I/O Bottlenecks in C
Tim Hammerquist November 08, 2019 #buffering #c #performanceMy friend showed me a piece of code he wrote years ago. It
takes binary data on stdin
and converts each octet to a sequence of 8
0
and 1
characters on stdout
. It optionally converts "sprawled"
data back to its binary form.
|
| |
|
Now, it had been a while since I'd written C++, and even longer since vanilla C, so I thought it would be fun to try my hand at implementing this myself.
For the purposes of this post, I'm only dealing with the "sprawl" portion (binary -> ASCII octets), not the inverse. I also made an effort to work from spec, not studying his code beforehand.
Take One
Here's what I came up with:
/* sprawl1.c */
static const char *BYTES = ;
int
Note that the BYTES
array is abbreviated. All 256 octets were
hard-coded.
I ran some benchmarks on the two versions to see how much speed I'd
eked out. Nothing more elaborate than the time
shell built-in.
Note that I keep both input and output as files on disk. I could have
used /dev/null
and gotten significantly faster times, but it would
defeat the purpose of my benchmarks. Throwing away data is cheap and
fast; file I/O is the bottleneck I'm trying to streamline!
sprawl0
is my friend's original version and it was more than twice as
fast as mine! What was I doing wrong? I went back to compare our
implementations. Here are some observations:
- He generates the byte -> bitstring table each time; I encode the full, pre-generated lookup table in the source.
- He allocates the table space off the heap at runtime; I keep it taking up space in the data segment.
- He reads one character at a time from
stdin
; I read 4KiB blocks and iterate over the read block one byte at a time. - He
fwrite
s the bitstrings out, one at a time, tostdout
; Iprintf
the bitstrings out, one at a time.
So I immediately see some problem points:
printf
is wastefully interpolating the bitstring into the format string, and then flushing the file buffer.- The lookup table is taking up needless space in the compiled binary:
256 * 9 => 2304 bytes
.
Wait, I just said I only need 8 bytes per bitstring! Where does that 9
come from? For string literals like "01010101"
, C always adds a null
byte \0
at the end for you. Helpful, right?
So here's my battle plan:
- By using
fwrite
, I can specify exactly what to write (the bitstring), exactly how many bytes (always 8), and write without the flushing the write buffer. - By generating the table at runtime, I can also save some space in the binary on disk, and get rid of those 256 rogue nulls in the process.
Take Two
// sprawl2.c
int
There! Our BYTES
array is now the minimum of 2048 bytes. I saved 256
bytes! BYTES
is also allocated (off the stack) at program launch, so
the binary is a bit smaller. And you can see in the fwrite
call that
only 8 bytes are written per byte of input, so that terminating null
byte is no longer necessary.
So where does that leave us?
This brings sprawl2
's performance in line with sprawl0
, but as good
as I felt saving that disk space and trimming those nulls, these
actually had the smallest effect on performance.
The vast majority of performance gained was eliminating the file flushing by switching from
printf
tofwrite
!
But I was still disappointed. While sprawl2
was marginally
outperforming sprawl0
, I was expecting a bigger margin. The fwrite
only helped so much.
Then I noticed how much user time I was using relative to system time. There really isn't much to my program. Why was the CPU spending so much time in my code?
Looking back at my code, the answer was staring right back at me: I'd
streamlined the fread
call to do bulk reads of 4KiB each, but I still
had this really tight loop calling fwrite
4096 times for each fread
call!
We can do better.
Best of Three
Reading in 4KiB at a time saved us a lot of overhead in earlier versions. Let's use the same approach for writing:
/* sprawl3.c */
int
I don't claim to write idiomatic C code, but the idea is simple:
- fill the
inbuf
- for each byte in
inbuf
, write its bitstring tooutbuf
- increment the
inbuf
index by 1, and theoutbuf
index by 8 - goto (2) until
inbuf
is exhausted - write
outbuf
to disk all at once - goto (1) until
stdin
is exhausted
In other words, do as much work as possible in my own program space without system calls, and let the system calls read and write as much of our pre-processed data as possible.
Survey Says
Well that's a little more like it! Even all the extra buffer shuffling I was doing in the last version barely registered on my CPU compared to the I/O time spent waiting for my SSD drive to read and write buffers.
Refactoring the I/O out of out tight
fwrite
loop more than doubled our performance again!
Conclusions
The two major lessons we learned are:
- Flushing output streams is slow! Buffering I/O gives greater throughput.
- Keep I/O out of tight loops! Factor slow I/O operations out of loops where possible.
A few other observations:
- Pre-generated lookup tables take up space on disk, but runtime generation takes CPU time. And neither had a significant impact on runtime!
- We could have used
fwrite
with the null-terminated bitstrings just fine. It would just never have copied the null byte.
There are a number of variables that I did not explore in this article that would make for interesting experimentation. Among these:
- Adjusting the size of
bufin
andbufout
- Adjusting the
nice(1)
value of the processes to prevent scheduling overhead - Using
mmap
- Reading and writing in different threads
- Using advanced profiling tools like
dtrace(1)
to find bottlenecks
Try It Yourself
These benchmarks will vary by machine, operating system, and CPU/device load. Feel free to try this code out for yourself, or send back comments or even PRs with improvements! Check out the code for this post for yourself.
Technical Notes
All code was compiled using the following flags:
CFLAGS =