I/O Bottlenecks in C

Tim Hammerquist November 08, 2019 #buffering #c #performance

My friend showed me a piece of code he wrote years ago. It takes binary data on stdin and converts each octet to a sequence of 8 0 and 1 characters on stdout. It optionally converts "sprawled" data back to its binary form.

% echo "hello, world" | ./sprawl
01101000011001010110110001101100011011110010110000100000011101110110111101110010011011000110010000001010% 
% echo "hello, world" | ./sprawl | ./sprawl -d
hello, world.
% ./sprawl < ./sprawl | ./sprawl -d > ./sprawl-copy
% diff ./sprawl ./sprawl-copy 
%

Now, it had been a while since I'd written C++, and even longer since vanilla C, so I thought it would be fun to try my hand at implementing this myself.

For the purposes of this post, I'm only dealing with the "sprawl" portion (binary -> ASCII octets), not the inverse. I also made an effort to work from spec, not studying his code beforehand.

Take One

Here's what I came up with:

/* sprawl1.c */

#include <stdint.h>
#include <stdio.h>

static const char *BYTES[256] = {
	"00000000", "00000001", "00000010", "00000011",
	"00000100", "00000101", "00000110", "00000111",
	"00001000", "00001001", "00001010", "00001011",
  /* ... ALL THE BITSTRINGS ... */
	"11111100", "11111101", "11111110", "11111111",
};

int main() {
	uint8_t buf[4096];
	size_t bytes_read;

	while (0 < (bytes_read = fread(buf, sizeof(buf[0]), sizeof(buf), stdin))) {
		for (size_t i = 0; i < bytes_read; ++i) {
			printf("%s", BYTES[buf[i]]);
		}
	}
}

Note that the BYTES array is abbreviated. All 256 octets were hard-coded.

I ran some benchmarks on the two versions to see how much speed I'd eked out. Nothing more elaborate than the time shell built-in.

Note that I keep both input and output as files on disk. I could have used /dev/null and gotten significantly faster times, but it would defeat the purpose of my benchmarks. Throwing away data is cheap and fast; file I/O is the bottleneck I'm trying to streamline!

% { time ./sprawl0 < bigfile > bigfile.out; } 2> sprawl0.time
% { time ./sprawl1 < bigfile > bigfile.out; } 2> sprawl1.time
% cat sprawl?.time
./sprawl0 < bigfile > bigfile.out  6.25s user 2.47s system 99% cpu  8.755 total
./sprawl1 < bigfile > bigfile.out 18.44s user 1.64s system 99% cpu 20.261 total

sprawl0 is my friend's original version and it was more than twice as fast as mine! What was I doing wrong? I went back to compare our implementations. Here are some observations:

He generates the byte -> bitstring table each time; I encode the full, pre-generated lookup table in the source.
He allocates the table space off the heap at runtime; I keep it taking up space in the data segment.
He reads one character at a time from stdin; I read 4KiB blocks and iterate over the read block one byte at a time.
He fwrites the bitstrings out, one at a time, to stdout; I printf the bitstrings out, one at a time.

So I immediately see some problem points:

printf is wastefully interpolating the bitstring into the format string, and then flushing the file buffer.
The lookup table is taking up needless space in the compiled binary: 256 * 9 => 2304 bytes.

Wait, I just said I only need 8 bytes per bitstring! Where does that 9 come from? For string literals like "01010101", C always adds a null byte \0 at the end for you. Helpful, right?

So here's my battle plan:

By using fwrite, I can specify exactly what to write (the bitstring), exactly how many bytes (always 8), and write without the flushing the write buffer.
By generating the table at runtime, I can also save some space in the binary on disk, and get rid of those 256 rogue nulls in the process.

Take Two

// sprawl2.c

#include <stdint.h>
#include <stdio.h>

int main() {
    uint8_t BYTES[256][8];
    uint8_t buf[4096];
    size_t bytes_read;

    for (int str = 0; str < 256; ++str) {
        for (int bit = 0; bit < 8; ++bit) {
            BYTES[str][7 - bit] = (str & (1 << bit)) ? '1' : '0';
        }
    }

    while (0 < (bytes_read = fread(buf, sizeof(buf[0]), sizeof(buf), stdin))) {
        for (size_t i = 0; i < bytes_read; ++i) {
            fwrite(BYTES[buf[i]], 1, 8, stdout);
        }
    }
}

There! Our BYTES array is now the minimum of 2048 bytes. I saved 256 bytes! BYTES is also allocated (off the stack) at program launch, so the binary is a bit smaller. And you can see in the fwrite call that only 8 bytes are written per byte of input, so that terminating null byte is no longer necessary.

So where does that leave us?

% { time ./sprawl2 < bigfile > bigfile.out; } 2> sprawl2.time
% cat sprawl?.time
./sprawl0 < bigfile > bigfile.out  6.25s user 2.47s system 99% cpu  8.755 total
./sprawl1 < bigfile > bigfile.out 18.44s user 1.64s system 99% cpu 20.261 total
./sprawl2 < bigfile > bigfile.out  5.67s user 2.41s system 98% cpu  8.165 total

This brings sprawl2's performance in line with sprawl0, but as good as I felt saving that disk space and trimming those nulls, these actually had the smallest effect on performance.

The vast majority of performance gained was eliminating the file flushing by switching from printf to fwrite!

But I was still disappointed. While sprawl2 was marginally outperforming sprawl0, I was expecting a bigger margin. The fwrite only helped so much.

Then I noticed how much user time I was using relative to system time. There really isn't much to my program. Why was the CPU spending so much time in my code?

Looking back at my code, the answer was staring right back at me: I'd streamlined the fread call to do bulk reads of 4KiB each, but I still had this really tight loop calling fwrite 4096 times for each fread call!

We can do better.

Best of Three

Reading in 4KiB at a time saved us a lot of overhead in earlier versions. Let's use the same approach for writing:

/* sprawl3.c */

#include <stdint.h>
#include <stdio.h>

#define BUF_SIZE 4096

int main() {
	uint8_t BYTES[256][8];
	uint8_t inbuf[BUF_SIZE];
	uint8_t outbuf[BUF_SIZE * 8];
	size_t bytes;

	for (int str = 0; str < 256; ++str) {
		for (int bit = 0; bit < 8; ++bit) {
			BYTES[str][7 - bit] = (str & (1 << bit)) ? '1' : '0';
		}
	}

	while (0 < (bytes = fread(inbuf, sizeof(inbuf[0]), sizeof(inbuf), stdin))) {
		for (size_t b = 0; b < bytes; ++b) {
			for (size_t s = 0; s < 8; ++s) {
				outbuf[b * 8 + s] = BYTES[inbuf[b]][s];
			}
		}
		fwrite(outbuf, sizeof(outbuf[0]), (bytes * sizeof(BYTES[0])), stdout);
	}
}

I don't claim to write idiomatic C code, but the idea is simple:

fill the inbuf
for each byte in inbuf, write its bitstring to outbuf
increment the inbuf index by 1, and the outbuf index by 8
goto (2) until inbuf is exhausted
write outbuf to disk all at once
goto (1) until stdin is exhausted

In other words, do as much work as possible in my own program space without system calls, and let the system calls read and write as much of our pre-processed data as possible.

Survey Says

% { time ./sprawl3 < bigfile > bigfile.out; } 2> sprawl3.time
% cat sprawl?.time
./sprawl0 < bigfile > bigfile.out  6.25s user 2.47s system 99% cpu  8.755 total
./sprawl1 < bigfile > bigfile.out 18.44s user 1.64s system 99% cpu 20.261 total
./sprawl2 < bigfile > bigfile.out  5.67s user 2.41s system 98% cpu  8.165 total
./sprawl3 < bigfile > bigfile.out  1.20s user 2.32s system 98% cpu  3.587 total

Well that's a little more like it! Even all the extra buffer shuffling I was doing in the last version barely registered on my CPU compared to the I/O time spent waiting for my SSD drive to read and write buffers.

Refactoring the I/O out of out tight fwrite loop more than doubled our performance again!

Conclusions

The two major lessons we learned are:

Flushing output streams is slow! Buffering I/O gives greater throughput.
Keep I/O out of tight loops! Factor slow I/O operations out of loops where possible.

A few other observations:

Pre-generated lookup tables take up space on disk, but runtime generation takes CPU time. And neither had a significant impact on runtime!
We could have used fwrite with the null-terminated bitstrings just fine. It would just never have copied the null byte.

There are a number of variables that I did not explore in this article that would make for interesting experimentation. Among these:

Adjusting the size of bufin and bufout
Adjusting the nice(1) value of the processes to prevent scheduling overhead
Using mmap
Reading and writing in different threads
Using advanced profiling tools like dtrace(1) to find bottlenecks

Try It Yourself

These benchmarks will vary by machine, operating system, and CPU/device load. Feel free to try this code out for yourself, or send back comments or even PRs with improvements! Check out the code for this post for yourself.

Technical Notes

All code was compiled using the following flags:

CFLAGS = -std=c11 -O3 -Werror -Wall -Wextra -Wconversion -pedantic -fno-builtin