I/O Bottlenecks in Rust
Tim Hammerquist November 13, 2019 #buffering #performance #rustIn my last post I took a perfectly good program and went about optimizing it. My first attempt was substantially slower, and I progressively found and fixed bottlenecks until it was a little over twice as fast as the original.
Now, several of my friends are fans of Rust, the programming language that obsessively ensures memory safety the way Haskell enforces type safety and immutability. So naturally, when I shared my work on C, they called out — nearly in unison — to mirror the effort in Rust.
You might already know what a zealous bunch the Rustaceans can be. You might also know that I can be notoriously easy to nerdsnipe. Either way... here we are.
The Problem
To recap, our program takes arbitrary binary input on stdin
and converts each byte to a series of bitstrings.
|
|
The %
isn't part of the program output; it just means that the output
ended without a final newline.
You can read the previous post to see my C solutions. In this post I'll be tackling the problem while I tinker with Rust!
Take One
My first take a Rust implementation follows.
// sprawl1.rs
use ;
static BYTES: = ;
You might notice it bears a strong resemblance to the sprawl0.c
and
sprawl1.c
programs from the previous post, mostly because it was
written about the same time. BYTES
is a pre-generated array of 256
"byte arrays".
It goes on to read each byte of input from stdin
and write out each
corresponding bitstring to stdout
. Let's test it with the same
bigfile
(~500MB) from the last post.
1
2
3 }
Wow. Even our slowest, least efficient C implementation ran in ~20 seconds. Rust can do better than this. My code needs some help. Where to start?
- I pregen the bitstring table, so that we can rule that out.
- I read one character at a time, and I confess, I couldn't have told
you at the time how
bytes()
works. But we know the fastest C implementation didn't do this. - I write each bitstring out, 8 bytes at a time. We know from the C version that even without flushing the stream, those write calls pile up.
Take Two
1 // sprawl2.rs
2
3 use ;
4
5 6 let mut lookup = ;
7 for i in 0..256 8 let s: String = format!;
9 lookup.copy_from_slice;
10
11
12 let stdin = stdin;
13 let bufin = stdin.lock;
14 let stdout = stdout;
15 let mut bufout = stdout.lock;
16
17 for byte in bufin.bytes 18 if let Ok = byte 19 let _ = bufout.write;
20
21
22
Yep, sure enough, those file I/O methods weren't buffered. But Rust has
an elegant way to get buffered access to the files using the lock()
method. Note that bufout
is declared mut
(mutable) so that we can
write to it. This iteration also includes a runtime-generated bitstring
table.
1
2
3 }
4 }
Well, that's a big improvement! I still expected Rust to be able to do
better than this, though. We saw in C that even with buffering, tight
I/O loops can bring a program to its knees. This brings us to about
where sprawl2.c
did. And there's that same high proportion of user
time - almost 80% of the processing time.
Best of Three
Let's see if we can't bulk up our I/O a little more.
// sprawl3.rs
use ;
I had a tip from one of my Rustacean friends about the buffered read. I
also learned a bit about the Result<T, E>
type while playing. I didn't
manage to get the writing buffered, but let's see where we're at now.
}
}
}
Okay, now we're getting somewhere. Still not on par with our fastest C version, thought, and still a large portion of time spent in user space. So let's see about buffering that output!
Back and Fourth
I'm not anywhere near fluent in Rust, so I just took what I know of C++
and applied the same approach we used in sprawl3.c
.
- fill the
inbuffer
- for each byte in
inbuffer
, write its bitstring tooutbuffer
- increment the
inbuffer
index by 1, and theoutbuffer
index by 8 - goto (2) until
inbuffer
is exhausted - write
outbuffer
tostdout
all at once - goto (1) until
stdin
is exhausted
I wrote what I could in what I thought was Rust, and employed type-error-driven-development — I kept looking up types and traits until I got this compile.
// sprawl4.rs
use ;
Here we go!
# Our most recent Rust implementation...
# ... and our fastest C implementation, for comparison
}
}
}
}
}
If you're surprised by the winning program, or by the margin... you're
not alone. My first thought was that I'd made a logic error, but no. Not
only was the output file the same size as all the others, but I used
the -d
feature of my buddy's original sprawl0
to verify:
;
Conclusions
So there we have it. A buffered, chunked Rust implementation is over 20% faster than my fastest C attempt!
Could someone write a faster Rust version? Probably. Could someone write a C program that outperforms the Rust one? Almost certainly. But I'll leave these as exercises for you, fearless Reader.
You can find all the code for these programs (and my time
helper
script) in my blogcode repository.
Did you spot any bugs in my code? File a PR or let me know in a comment!
Do you have any questions, suggestions, or criticisms for me? Please let me know below!