I/O Bottlenecks in CTim Hammerquist November 08, 2019 #buffering #c #performance
My friend showed me a piece of code he wrote years ago. It
takes binary data on
stdin and converts each octet to a sequence of 8
1 characters on
stdout. It optionally converts "sprawled"
data back to its binary form.
| | | |
Now, it had been a while since I'd written C++, and even longer since vanilla C, so I thought it would be fun to try my hand at implementing this myself.
For the purposes of this post, I'm only dealing with the "sprawl" portion (binary -> ASCII octets), not the inverse. I also made an effort to work from spec, not studying his code beforehand.
Here's what I came up with:
/* sprawl1.c */ static const char *BYTES = ; int
Note that the
BYTES array is abbreviated. All 256 octets were
I ran some benchmarks on the two versions to see how much speed I'd
eked out. Nothing more elaborate than the
time shell built-in.
Note that I keep both input and output as files on disk. I could have
/dev/null and gotten significantly faster times, but it would
defeat the purpose of my benchmarks. Throwing away data is cheap and
fast; file I/O is the bottleneck I'm trying to streamline!
sprawl0 is my friend's original version and it was more than twice as
fast as mine! What was I doing wrong? I went back to compare our
implementations. Here are some observations:
- He generates the byte -> bitstring table each time; I encode the full, pre-generated lookup table in the source.
- He allocates the table space off the heap at runtime; I keep it taking up space in the data segment.
- He reads one character at a time from
stdin; I read 4KiB blocks and iterate over the read block one byte at a time.
fwrites the bitstrings out, one at a time, to
printfthe bitstrings out, one at a time.
So I immediately see some problem points:
printfis wastefully interpolating the bitstring into the format string, and then flushing the file buffer.
- The lookup table is taking up needless space in the compiled binary:
256 * 9 => 2304 bytes.
Wait, I just said I only need 8 bytes per bitstring! Where does that 9
come from? For string literals like
"01010101", C always adds a null
\0 at the end for you. Helpful, right?
So here's my battle plan:
- By using
fwrite, I can specify exactly what to write (the bitstring), exactly how many bytes (always 8), and write without the flushing the write buffer.
- By generating the table at runtime, I can also save some space in the binary on disk, and get rid of those 256 rogue nulls in the process.
// sprawl2.c int
BYTES array is now the minimum of 2048 bytes. I saved 256
BYTES is also allocated (off the stack) at program launch, so
the binary is a bit smaller. And you can see in the
fwrite call that
only 8 bytes are written per byte of input, so that terminating null
byte is no longer necessary.
So where does that leave us?
sprawl2's performance in line with
sprawl0, but as good
as I felt saving that disk space and trimming those nulls, these
actually had the smallest effect on performance.
The vast majority of performance gained was eliminating the file flushing by switching from
But I was still disappointed. While
sprawl2 was marginally
sprawl0, I was expecting a bigger margin. The
only helped so much.
Then I noticed how much user time I was using relative to system time. There really isn't much to my program. Why was the CPU spending so much time in my code?
Looking back at my code, the answer was staring right back at me: I'd
fread call to do bulk reads of 4KiB each, but I still
had this really tight loop calling
fwrite 4096 times for each
We can do better.
Best of Three
Reading in 4KiB at a time saved us a lot of overhead in earlier versions. Let's use the same approach for writing:
/* sprawl3.c */ int
I don't claim to write idiomatic C code, but the idea is simple:
- fill the
- for each byte in
inbuf, write its bitstring to
- increment the
inbufindex by 1, and the
outbufindex by 8
- goto (2) until
outbufto disk all at once
- goto (1) until
In other words, do as much work as possible in my own program space without system calls, and let the system calls read and write as much of our pre-processed data as possible.
Well that's a little more like it! Even all the extra buffer shuffling I was doing in the last version barely registered on my CPU compared to the I/O time spent waiting for my SSD drive to read and write buffers.
Refactoring the I/O out of out tight
fwriteloop more than doubled our performance again!
The two major lessons we learned are:
- Flushing output streams is slow! Buffering I/O gives greater throughput.
- Keep I/O out of tight loops! Factor slow I/O operations out of loops where possible.
A few other observations:
- Pre-generated lookup tables take up space on disk, but runtime generation takes CPU time. And neither had a significant impact on runtime!
- We could have used
fwritewith the null-terminated bitstrings just fine. It would just never have copied the null byte.
There are a number of variables that I did not explore in this article that would make for interesting experimentation. Among these:
- Adjusting the size of
- Adjusting the
nice(1)value of the processes to prevent scheduling overhead
- Reading and writing in different threads
- Using advanced profiling tools like
dtrace(1)to find bottlenecks
Try It Yourself
These benchmarks will vary by machine, operating system, and CPU/device load. Feel free to try this code out for yourself, or send back comments or even PRs with improvements! Check out the code for this post for yourself.
All code was compiled using the following flags: