I/O Bottlenecks in RustTim Hammerquist November 13, 2019 #buffering #performance #rust
In my last post I took a perfectly good program and went about optimizing it. My first attempt was substantially slower, and I progressively found and fixed bottlenecks until it was a little over twice as fast as the original.
Now, several of my friends are fans of Rust, the programming language that obsessively ensures memory safety the way Haskell enforces type safety and immutability. So naturally, when I shared my work on C, they called out — nearly in unison — to mirror the effort in Rust.
You might already know what a zealous bunch the Rustaceans can be. You might also know that I can be notoriously easy to nerdsnipe. Either way... here we are.
To recap, our program takes arbitrary binary input on
and converts each byte to a series of bitstrings.
% isn't part of the program output; it just means that the output
ended without a final newline.
You can read the previous post to see my C solutions. In this post I'll be tackling the problem while I tinker with Rust!
My first take a Rust implementation follows.
// sprawl1.rs use ; static BYTES: = ;
You might notice it bears a strong resemblance to the
sprawl1.c programs from the previous post, mostly because it was
written about the same time.
BYTES is a pre-generated array of 256
It goes on to read each byte of input from
stdin and write out each
corresponding bitstring to
stdout. Let's test it with the same
bigfile (~500MB) from the last post.
1 2 3 }
Wow. Even our slowest, least efficient C implementation ran in ~20 seconds. Rust can do better than this. My code needs some help. Where to start?
- I pregen the bitstring table, so that we can rule that out.
- I read one character at a time, and I confess, I couldn't have told
you at the time how
bytes()works. But we know the fastest C implementation didn't do this.
- I write each bitstring out, 8 bytes at a time. We know from the C version that even without flushing the stream, those write calls pile up.
1 // sprawl2.rs 2 3 use ; 4 5 6 let mut lookup = ; 7 for i in 0..256 8 let s: String = format!; 9 lookup.copy_from_slice; 10 11 12 let stdin = stdin; 13 let bufin = stdin.lock; 14 let stdout = stdout; 15 let mut bufout = stdout.lock; 16 17 for byte in bufin.bytes 18 if let Ok = byte 19 let _ = bufout.write; 20 21 22
Yep, sure enough, those file I/O methods weren't buffered. But Rust has
an elegant way to get buffered access to the files using the
method. Note that
bufout is declared
mut (mutable) so that we can
write to it. This iteration also includes a runtime-generated bitstring
1 2 3 } 4 }
Well, that's a big improvement! I still expected Rust to be able to do
better than this, though. We saw in C that even with buffering, tight
I/O loops can bring a program to its knees. This brings us to about
sprawl2.c did. And there's that same high proportion of user
time - almost 80% of the processing time.
Best of Three
Let's see if we can't bulk up our I/O a little more.
// sprawl3.rs use ;
I had a tip from one of my Rustacean friends about the buffered read. I
also learned a bit about the
Result<T, E> type while playing. I didn't
manage to get the writing buffered, but let's see where we're at now.
} } }
Okay, now we're getting somewhere. Still not on par with our fastest C version, thought, and still a large portion of time spent in user space. So let's see about buffering that output!
Back and Fourth
I'm not anywhere near fluent in Rust, so I just took what I know of C++
and applied the same approach we used in
- fill the
- for each byte in
inbuffer, write its bitstring to
- increment the
inbufferindex by 1, and the
outbufferindex by 8
- goto (2) until
stdoutall at once
- goto (1) until
I wrote what I could in what I thought was Rust, and employed type-error-driven-development — I kept looking up types and traits until I got this compile.
// sprawl4.rs use ;
Here we go!
# Our most recent Rust implementation... # ... and our fastest C implementation, for comparison } } } } }
If you're surprised by the winning program, or by the margin... you're
not alone. My first thought was that I'd made a logic error, but no. Not
only was the output file the same size as all the others, but I used
-d feature of my buddy's original
sprawl0 to verify:
So there we have it. A buffered, chunked Rust implementation is over 20% faster than my fastest C attempt!
Could someone write a faster Rust version? Probably. Could someone write a C program that outperforms the Rust one? Almost certainly. But I'll leave these as exercises for you, fearless Reader.
You can find all the code for these programs (and my
script) in my blogcode repository.
Did you spot any bugs in my code? File a PR or let me know in a comment!
Do you have any questions, suggestions, or criticisms for me? Please let me know below!