A Scala Interview - Part 2Tim Hammerquist July 06, 2018 #algorithms #deepdive #interview #scala
In this post we'll be developing the solution that we began designing in the last post. As a refresher, here's the problem we're solving.
Design and implement a method to select the largest
mvalues from a very large list of
Last time we learned that sorting the entire sequence of
doesn't scale. In some cases, the numbers may not even fit in
memory. But with our solution, this won't be a problem.
Mind Your Trees and Queues
Enter the heap (not to be confused with the JVM's heap memory), also known as a priority queue.
A heap will store any number of values, but rather than dispensing values in the order in which they were added (i.e., "first"), heaps dispense the largest (in the case of a Max Heap) or smallest (for Min Heap) values.
Scala's standard library includes an implementation of the Heap in the PriorityQueue type.
scala> val queue = collection.mutable.PriorityQueue(1, 3, 5, 7, 9, 2, 4, 6, 8, 0) queue: collection.mutable.PriorityQueue = PriorityQueue(9, 8, 5, 7, 3, 2, 4, 6, 1, 0) scala> queue.head res0: Int = 9 scala> queue.dequeue res1: Int = 9 scala> queue.head res2: Int = 8 scala> queue.dequeueAll res3: collection.immutable.IndexedSeq = Vector(8, 7, 6, 5, 4, 3, 2, 1, 0)
There are a few things to notice here:
- The order in which values are dequeued is independent of the order of insertion.
- Values are not (necessarily) stored in-order internally.
- Values are returned in order when dequeued.
The heap is far more powerful (and efficient) than manually sorting a list. (See the Appendix for more on this.)
Thus, the heap is our infamous mystery collection. Let's call it
TopQueue Force Assemble!
Now that we have a design for our collection, it's time to start putting the pieces together.
TopQueue will maintain the biggest
m values. Here's
what the core logic looks like:
In the next few subsections we'll design an interface, devise tests for that interface, and (finally) fill in the code to make it all work.
Speak My Language
The essential operations we need from our
- Add one or more elements.
- Query the number of elements.
- When done, collect the contents.
Let's define our interface around these operations:
def add(element: Int): Unit def size: Int def result: Seq[Int]
Don't Test Me, Bro
Sweet. Now that we have an interface, we get to the fun part: testing!
Using the ScalaTest framework, we can write the following basic testcases:
extends FlatSpec with Matchers
TopQueue is using a Min Heap (i.e., a PriorityQueue with its
ordering reversed). We'll see how this works in the next section.
Got to Get Queue into My Life
Now that we've got a basic design and a test suite to specify the behavior we expect, let's write some code.
(val capacity: Int)
That's it. Not much to look at, is it? Three public methods wrapping a private PriorityQueue member. There are two important things happening in this small snippet of code.
First, look at the PriorityQueue instantiation:
private val queue = PriorityQueue.empty(Ordering.reverse)
Note the argument:
Ordering[Int].reverse This tells the
PriorityQueue to invert the comparison of its arguments; rather
than ordering by the largest values, it orders by smallest,
queue into a Min Heap.
Second, notice the
if expression in the
add() method. This is the
scala expansion of our pseudocode from before. But instead of using a
Max Heap to store all the values and extracting only the first
we use a Min Heap to store only the values that are larger than
the smallest in the heap. This gives us the following performance
- Removing an old, smaller value: constant time (C)
- Inserting a new, larger value: logarithmic time (log m)
Given this implementation, we get the following output from our test class:
The Check's in the Scale
Remember our "very big n" from the first post? Let's write a test case to make sure our solution lives up to all the promises I've made!
it should "handle a lot of input values quickly" in
With a timeout of 30 seconds, I've gambled that
TopQueue performs at
least twice as fast as our naïve solution. Here goes...
The test passed, so we've obviously come in under the 30 second timer. And if we look at the scalatest output, we can see that the whole suite of 5 tests ran in just over 1 second. This amounts to approximately a 60x speed increase versus our first solution.
Hey, we didn't even need to increase the heap size to get this major performance win! In fact, I've verified on my machine that this runs with a heap well under 32MB, down from over 3GB. That's almost a 100x space efficiency improvement.
If you want to see how we achieve such a drastic improvement in both time and space, keep reading.
To see how we can give our minimalist collection some Scala flavor and make it more general and useful, check out part 3.
Appendix: Big O Analysis
How did we get such a drastic improvement in both time and space?
You might remember from part 1 that our first solution
had a complexity of about O(n log n). What are the complexities of our
- n = 100,000,000
- m = 100
|Sorting||n log n ⇒||~63.017 s||n log n ⇒||> 3000 MB|
|TopQueue||n log m ⇒||~1.038 s||m ⇒||< 32 MB|
|~60x faster||~99% less space|
The Time Complexity of
TopQueue is n log m. This means that the
time required is proportional to the total number of elements
processed (to read, compare, and optionally store them) multiplied by
the log of the number of elements stored. Because of the
tree-based implementation of the heap, insertion (and removal) are
The Space Complexity of
TopQueue is the big win here. We never need
more space than it takes to store our 100 top values. This is how
we're able to process all 100,000,000 values using less heap memory
than it would take to even store all their raw 32-bit values.
Note that this notation is designed to help us analyze and compare relative complexities, not to give us concrete time/memory estimates.
For example, the log in the table doesn't equate to log2 or loge. It depends on several factors, including the type, size, and depth of the data structure, and the efficiency of the algorithm.
We'll see how we can make our super efficient code even more useful — and easy to use — in part 3.