I/O Bottlenecks in Rust
November 13, 2019 #Bottlenecks #Buffering #Performance #RustIn my last post I took a perfectly good program and went about optimizing it. My first attempt was substantially slower, and I progressively found and fixed bottlenecks until it was a little over twice as fast as the original.
Now, several of my friends are fans of Rust, the programming language that obsessively ensures memory safety the way Haskell enforces type safety and immutability. So naturally, when I shared my work on C, they called out — nearly in unison — to mirror the effort in Rust.
You might already know what a zealous bunch the Rustaceans can be. You might also know that I can be notoriously easy to nerdsnipe. Either way... here we are.
The Problem
To recap, our program takes arbitrary binary input on stdin
and converts each byte to a series of bitstrings.
% echo | ./sprawl
00001010%
% echo "hello, world" | ./sprawl
01101000011001010110110001101100011011110010110000100000011101110110111101110010011011000110010000001010%
The %
isn't part of the program output; it just means that the output
ended without a final newline.
You can read the previous post to see my C solutions. In this post I'll be tackling the problem while I tinker with Rust!
Take One
My first take a Rust implementation follows.
// sprawl1.rs
use std::io::{self, Read, Write};
static BYTES: [&[u8]; 256] = [
b"00000000", b"00000001", b"00000010", b"00000011",
b"00000100", b"00000101", b"00000110", b"00000111",
b"00001000", b"00001001", b"00001010", b"00001011",
// the rest of the bitstrings
b"11111100", b"11111101", b"11111110", b"11111111",
];
fn main() {
let stdin = io::stdin();
let mut stdout = io::stdout();
for c in stdin.bytes() {
if let Ok(b) = c {
let _ = stdout.write(BYTES[b as usize]);
}
}
}
You might notice it bears a strong resemblance to the sprawl0.c
and
sprawl1.c
programs from the previous post, mostly because it was
written about the same time. BYTES
is a pre-generated array of 256
"byte arrays".
It goes on to read each byte of input from stdin
and write out each
corresponding bitstring to stdout
. Let's test it with the same
bigfile
(~500MB) from the last post.
1 % ./time sprawl1 bigfile
2 % cat sprawl?.time
3 {"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
Wow. Even our slowest, least efficient C implementation ran in ~20 seconds. Rust can do better than this. My code needs some help. Where to start?
- I pregen the bitstring table, so that we can rule that out.
- I read one character at a time, and I confess, I couldn't have told
you at the time how
bytes()
works. But we know the fastest C implementation didn't do this. - I write each bitstring out, 8 bytes at a time. We know from the C version that even without flushing the stream, those write calls pile up.
Take Two
1 // sprawl2.rs
2
3 use std::io::{self, Read, Write};
4
5 fn main() {
6 let mut lookup = [[0u8; 8]; 256];
7 for i in 0..256 {
8 let s: String = format!("{:08b}", i);
9 lookup[i].copy_from_slice(s.as_bytes());
10 }
11
12 let stdin = io::stdin();
13 let bufin = stdin.lock();
14 let stdout = io::stdout();
15 let mut bufout = stdout.lock();
16
17 for byte in bufin.bytes() {
18 if let Ok(b) = byte {
19 let _ = bufout.write(&lookup[b as usize]);
20 }
21 }
22 }
Yep, sure enough, those file I/O methods weren't buffered. But Rust has
an elegant way to get buffered access to the files using the lock()
method. Note that bufout
is declared mut
(mutable) so that we can
write to it. This iteration also includes a runtime-generated bitstring
table.
1 % ./time sprawl2 bigfile
2 % cat sprawl?.time
3 {"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
4 {"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}
Well, that's a big improvement! I still expected Rust to be able to do
better than this, though. We saw in C that even with buffering, tight
I/O loops can bring a program to its knees. This brings us to about
where sprawl2.c
did. And there's that same high proportion of user
time - almost 80% of the processing time.
Best of Three
Let's see if we can't bulk up our I/O a little more.
// sprawl3.rs
use std::io::{self, Read, Write};
fn main() {
let mut lookup = [[0u8; 8]; 256];
for i in 0..256 {
let s: String = format!("{:08b}", i);
lookup[i].copy_from_slice(s.as_bytes());
}
let stdin = io::stdin();
let mut bufin = stdin.lock();
let stdout = io::stdout();
let mut bufout = stdout.lock();
let mut buffer = [0u8; 4096];
while let Ok(n) = bufin.read(&mut buffer) {
if n == 0 {
break;
}
for b in &buffer[0..n] {
let _ = bufout.write(&lookup[*b as usize]);
}
}
}
I had a tip from one of my Rustacean friends about the buffered read. I
also learned a bit about the Result<T, E>
type while playing. I didn't
manage to get the writing buffered, but let's see where we're at now.
% ./time sprawl3 bigfile
% cat sprawl?.time
{"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
{"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}
{"./sprawl3": {"usr": 8.27, "sys": 4.78, "total": 13.05, "cpu": "99%"}}
Okay, now we're getting somewhere. Still not on par with our fastest C version, thought, and still a large portion of time spent in user space. So let's see about buffering that output!
Back and Fourth
I'm not anywhere near fluent in Rust, so I just took what I know of C++
and applied the same approach we used in sprawl3.c
.
- fill the
inbuffer
- for each byte in
inbuffer
, write its bitstring tooutbuffer
- increment the
inbuffer
index by 1, and theoutbuffer
index by 8 - goto (2) until
inbuffer
is exhausted - write
outbuffer
tostdout
all at once - goto (1) until
stdin
is exhausted
I wrote what I could in what I thought was Rust, and employed type-error-driven-development — I kept looking up types and traits until I got this compile.
// sprawl4.rs
use std::io::{self, Read, Write};
fn main() {
// generate the bitstring lookup table
let mut lookup = [[0u8; 8]; 256];
for i in 0..256 {
let s: String = format!("{:08b}", i);
lookup[i].copy_from_slice(s.as_bytes());
}
// get buffered handles on stdin and stdout
let stdin = io::stdin();
let stdout = io::stdout();
let mut bufin = stdin.lock();
let mut bufout = stdout.lock();
// allocate read and write buffers
let mut inbuffer = [0u8; 4096];
let mut outbuffer = [0u8; 32768];
// fill inbuffer
while let Ok(n) = bufin.read(&mut inbuffer) {
// quit if we're done reading
if n == 0 {
break;
}
// write bitstring for each byte in inbuffer
for (i, b) in inbuffer[0..n].iter().enumerate() {
for (j, bit) in lookup[*b as usize].iter().enumerate() {
outbuffer[i * 8 + j] = *bit;
}
}
// dump outbuffer
let _ = bufout.write(&outbuffer[0..(n*8)]);
}
}
Here we go!
# Our most recent Rust implementation...
% ./time sprawl4 bigfile
# ... and our fastest C implementation, for comparison
% ./time sprawlc bigfile
% cat sprawl?.time
{"./sprawl1": {"usr": 28.31, "sys": 4.01, "total": 32.97, "cpu": "98%"}}
{"./sprawl2": {"usr": 15.23, "sys": 4.53, "total": 19.77, "cpu": "99%"}}
{"./sprawl3": {"usr": 8.36, "sys": 4.63, "total": 13.12, "cpu": "99%"}}
{"./sprawl4": {"usr": 0.33, "sys": 1.56, "total": 1.90, "cpu": "99%"}}
{"./sprawlc": {"usr": 1.09, "sys": 1.47, "total": 2.57, "cpu": "100%"}}
If you're surprised by the winning program, or by the margin... you're
not alone. My first thought was that I'd made a logic error, but no. Not
only was the output file the same size as all the others, but I used
the -d
feature of my buddy's original sprawl0
to verify:
% ../c/sprawl0 -d < out > bigfile.dupe
% diff -q bigfile bigfile.dupe; echo $?
0
Conclusions
So there we have it. A buffered, chunked Rust implementation is over 20% faster than my fastest C attempt!
Could someone write a faster Rust version? Probably. Could someone write a C program that outperforms the Rust one? Almost certainly. But I'll leave these as exercises for you, fearless Reader.
You can find all the code for these programs (and my time
helper
script) in my blogcode repository.
Did you spot any bugs in my code? File a PR or let me know in a comment!
Do you have any questions, suggestions, or criticisms for me? Please let me know below!