I/O Bottlenecks in Rust

November 13, 2019 #Bottlenecks #Buffering #Performance #Rust

Bottlenecks (2-part Series)
I/O Bottlenecks in C
I/O Bottlenecks in Rust

In my last post I took a perfectly good program and went about optimizing it. My first attempt was substantially slower, and I progressively found and fixed bottlenecks until it was a little over twice as fast as the original.

Now, several of my friends are fans of Rust, the programming language that obsessively ensures memory safety the way Haskell enforces type safety and immutability. So naturally, when I shared my work on C, they called out — nearly in unison — to mirror the effort in Rust.

You might already know what a zealous bunch the Rustaceans can be. You might also know that I can be notoriously easy to nerdsnipe. Either way... here we are.

The Problem

To recap, our program takes arbitrary binary input on stdin and converts each byte to a series of bitstrings.

% echo | ./sprawl
00001010%
% echo "hello, world" | ./sprawl
01101000011001010110110001101100011011110010110000100000011101110110111101110010011011000110010000001010%

The % isn't part of the program output; it just means that the output ended without a final newline.

You can read the previous post to see my C solutions. In this post I'll be tackling the problem while I tinker with Rust!

Take One

My first take a Rust implementation follows.

// sprawl1.rs

use std::io::{self, Read, Write};

static BYTES: [&[u8]; 256] = [
    b"00000000", b"00000001", b"00000010", b"00000011",
    b"00000100", b"00000101", b"00000110", b"00000111",
    b"00001000", b"00001001", b"00001010", b"00001011",
    // the rest of the bitstrings
    b"11111100", b"11111101", b"11111110", b"11111111",
];

fn main() {
    let stdin = io::stdin();
    let mut stdout = io::stdout();

    for c in stdin.bytes() {
        if let Ok(b) = c {
            let _ = stdout.write(BYTES[b as usize]);
        }
    }
}

You might notice it bears a strong resemblance to the sprawl0.c and sprawl1.c programs from the previous post, mostly because it was written about the same time. BYTES is a pre-generated array of 256 "byte arrays".

It goes on to read each byte of input from stdin and write out each corresponding bitstring to stdout. Let's test it with the same bigfile (~500MB) from the last post.

1 % ./time sprawl1 bigfile
2 % cat sprawl?.time
3 {"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}

Wow. Even our slowest, least efficient C implementation ran in ~20 seconds. Rust can do better than this. My code needs some help. Where to start?

I pregen the bitstring table, so that we can rule that out.
I read one character at a time, and I confess, I couldn't have told you at the time how bytes() works. But we know the fastest C implementation didn't do this.
I write each bitstring out, 8 bytes at a time. We know from the C version that even without flushing the stream, those write calls pile up.

Take Two

1 // sprawl2.rs
2 
3 use std::io::{self, Read, Write};
4 
5 fn main() {
6     let mut lookup = [[0u8; 8]; 256];
7     for i in 0..256 {
8         let s: String = format!("{:08b}", i);
9         lookup[i].copy_from_slice(s.as_bytes());
10     }
11 
12     let stdin = io::stdin();
13     let bufin = stdin.lock();
14     let stdout = io::stdout();
15     let mut bufout = stdout.lock();
16 
17     for byte in bufin.bytes() {
18         if let Ok(b) = byte {
19             let _ = bufout.write(&lookup[b as usize]);
20         }
21     }
22 }

Yep, sure enough, those file I/O methods weren't buffered. But Rust has an elegant way to get buffered access to the files using the lock() method. Note that bufout is declared mut (mutable) so that we can write to it. This iteration also includes a runtime-generated bitstring table.

1 % ./time sprawl2 bigfile
2 % cat sprawl?.time
3 {"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
4 {"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}

Well, that's a big improvement! I still expected Rust to be able to do better than this, though. We saw in C that even with buffering, tight I/O loops can bring a program to its knees. This brings us to about where sprawl2.c did. And there's that same high proportion of user time - almost 80% of the processing time.

Best of Three

Let's see if we can't bulk up our I/O a little more.

// sprawl3.rs

use std::io::{self, Read, Write};

fn main() {
    let mut lookup = [[0u8; 8]; 256];
    for i in 0..256 {
        let s: String = format!("{:08b}", i);
        lookup[i].copy_from_slice(s.as_bytes());
    }

    let stdin = io::stdin();
    let mut bufin = stdin.lock();
    let stdout = io::stdout();
    let mut bufout = stdout.lock();

    let mut buffer = [0u8; 4096];
    while let Ok(n) = bufin.read(&mut buffer) {
        if n == 0 {
            break;
        }
        for b in &buffer[0..n] {
            let _ = bufout.write(&lookup[*b as usize]);
        }
    }
}

I had a tip from one of my Rustacean friends about the buffered read. I also learned a bit about the Result<T, E> type while playing. I didn't manage to get the writing buffered, but let's see where we're at now.

% ./time sprawl3 bigfile
% cat sprawl?.time
{"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
{"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}
{"./sprawl3": {"usr":  8.27, "sys": 4.78, "total": 13.05, "cpu": "99%"}}

Okay, now we're getting somewhere. Still not on par with our fastest C version, thought, and still a large portion of time spent in user space. So let's see about buffering that output!

Back and Fourth

I'm not anywhere near fluent in Rust, so I just took what I know of C++ and applied the same approach we used in sprawl3.c.

fill the inbuffer
for each byte in inbuffer, write its bitstring to outbuffer
increment the inbuffer index by 1, and the outbuffer index by 8
goto (2) until inbuffer is exhausted
write outbuffer to stdout all at once
goto (1) until stdin is exhausted

I wrote what I could in what I thought was Rust, and employed type-error-driven-development — I kept looking up types and traits until I got this compile.

// sprawl4.rs

use std::io::{self, Read, Write};

fn main() {
    // generate the bitstring lookup table
    let mut lookup = [[0u8; 8]; 256];
    for i in 0..256 {
        let s: String = format!("{:08b}", i);
        lookup[i].copy_from_slice(s.as_bytes());
    }

    // get buffered handles on stdin and stdout
    let stdin = io::stdin();
    let stdout = io::stdout();
    let mut bufin = stdin.lock();
    let mut bufout = stdout.lock();

    // allocate read and write buffers
    let mut inbuffer = [0u8; 4096];
    let mut outbuffer = [0u8; 32768];

    // fill inbuffer
    while let Ok(n) = bufin.read(&mut inbuffer) {
        // quit if we're done reading
        if n == 0 {
            break;
        }

        // write bitstring for each byte in inbuffer
        for (i, b) in inbuffer[0..n].iter().enumerate() {
            for (j, bit) in lookup[*b as usize].iter().enumerate() {
                outbuffer[i * 8 + j] = *bit;
            }
        }

        // dump outbuffer
        let _ = bufout.write(&outbuffer[0..(n*8)]);
    }
}

Here we go!

# Our most recent Rust implementation...
% ./time sprawl4 bigfile
# ... and our fastest C implementation, for comparison
% ./time sprawlc bigfile
% cat sprawl?.time
{"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
{"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}
{"./sprawl3": {"usr":  8.36, "sys": 4.63, "total": 13.12, "cpu": "99%"}}
{"./sprawl4": {"usr":  0.33, "sys": 1.56, "total":  1.90, "cpu": "99%"}}
{"./sprawlc": {"usr":  1.09, "sys": 1.47, "total":  2.57, "cpu": "100%"}}

If you're surprised by the winning program, or by the margin... you're not alone. My first thought was that I'd made a logic error, but no. Not only was the output file the same size as all the others, but I used the -d feature of my buddy's original sprawl0 to verify:

% ../c/sprawl0 -d < out > bigfile.dupe
% diff -q bigfile bigfile.dupe; echo $?
0

Conclusions

So there we have it. A buffered, chunked Rust implementation is over 20% faster than my fastest C attempt!

Could someone write a faster Rust version? Probably. Could someone write a C program that outperforms the Rust one? Almost certainly. But I'll leave these as exercises for you, fearless Reader.

You can find all the code for these programs (and my time helper script) in my blogcode repository.

Did you spot any bugs in my code? File a PR or let me know in a comment!

Do you have any questions, suggestions, or criticisms for me? Please let me know below!

1	% ./time sprawl1 bigfile
2	% cat sprawl?.time
3	{"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}

1	// sprawl2.rs
2
3	use std::io::{self, Read, Write};
4
5	fn main() {
6	let mut lookup = [[0u8; 8]; 256];
7	for i in 0..256 {
8	let s: String = format!("{:08b}", i);
9	lookup[i].copy_from_slice(s.as_bytes());
10	}
11
12	let stdin = io::stdin();
13	let bufin = stdin.lock();
14	let stdout = io::stdout();
15	let mut bufout = stdout.lock();
16
17	for byte in bufin.bytes() {
18	if let Ok(b) = byte {
19	let _ = bufout.write(&lookup[b as usize]);
20	}
21	}
22	}

1	% ./time sprawl2 bigfile
2	% cat sprawl?.time
3	{"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
4	{"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}