TopQueue in Scala - Part 1
July 01, 2018 #Algorithms #Data Structures #Deep Dive #Interviews #Scala #TopQueueI was once given the following interview question:
Design and implement a method to select the largest
m
values from a very large list ofn
numbers.
I liked this question because solving it (the way we'll be doing) touches on a good cross section of software design in Scala:
- complexity ("Big O") analysis
- polymorphism (subclassing, traits, and bounds)
- testing!
- ...and a few Scala-flavored ideas:
- companion objects
- implicit values
In this post we'll examine the problem, calling out a few factors that make this question interesting, and propose a design for our solution.
In upcoming posts, we'll follow the design through implementation, functional testing, and add a few bells and whistles.
While the concepts we'll discuss are applicable throughout Computer Science, all of our implementations here will use the Scala programming language.
n to the Great Wide Open
The "simplest" solution is to sort all n
numbers and take the last
m
elements. In Scala, finding the 5 largest numbers in a List of
20 values might look like this:
scala> val rng = util.Random
rng: util.Random.type = scala.util.Random$@3cd9aa64
scala> val (n, m) = (10, 3)
n: Int = 10
m: Int = 3
scala> val randomNums = Vector.fill(n)(rng.nextInt(100))
randomNums: Seq[Int] = Vector(51, 2, 80, 70, 35, 52, 14, 12, 43, 92)
scala> val top_m = randomNums.sorted.takeRight(m)
top_m: Seq[Int] = List(70, 80, 92)
Well, that wasn't so hard, was it? Let's up the game:
scala> val (n, m) = (1000000, 100)
n: Int = 1000000
m: Int = 100
scala> val nums = Vector.fill(n)(rng.nextInt)
nums: scala.collection.immutable.Vector[Int] = Vector(230555202, 453704703, 1552626762, ...)
scala> val topNums = nums.sorted.takeRight(m)
res1: scala.collection.immutable.Vector[Int] = Vector(2147263479, 2147263783, 2147268655, ...)
No sweat! Scala handles that in a little over a second on my laptop.
Attempt #1
How about 100,000,000? It might vary on your own machine, but I received one of the following exceptions every time:
scala> val topNums = nums.sorted.takeRight(m)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
Attempt #2
Attempt #1 used the default heap constraints for the JVM. If I increase the heap size to 2GB, I still run out of memory.
scala> val topNums = nums.sorted.takeRight(m)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:177)
Attempt #3
With 3GB of heap, the JVM ground for over 10 minutes before I finally interrupted it.
scala> val topNums = nums.sorted.takeRight(m)
^C
Attempt #4
If I go up to 4GB --- and wait just over a minute --- I finally get the expected result!
scala> val topNums = nums.sorted.takeRight(m)
Vector(2147479698, 2147479714, 2147479772, 2147479777, ...)
// Elapsed time: 63116ms
Brute Force vs the Big O
But that still took a while. And what went wrong the first couple times?
Well, sorting algorithms are expensive, and there are two major factors at play here.
Space Complexity
Sorting requires additional memory to store the hundreds of thousands of intermediate, temporary lists created during the sorting process. It's these temporary objects that consumed all the available heap memory. Even with a heap of 2GB, these objects couldn't fit in memory at once, so the JVM finally collapsed. This is due to the Space Complexity of our solution.
Time Complexity
Increasing the memory to 4GB gave us enough space for our solution, but 60+ seconds is a pretty noticeable chunk of time. This is the Time Complexity of our solution, as we need to sort every element of our 100,000,000 numbers --- a process that can take approximately 1.8 billion distinct comparisons, possibly many more.
Covering algorithmic complexity in depth is beyond the scope of this post, so if this is still voodoo to you, here are some links that should help you get you started:
- Algorithms, (site, Khan Academy)
- Mastering Algorithms with Perl (book, O'Reilly Media)
- Introduction to Algorithms (book, MIT Press)
What was the question?
But we only want the biggest 100 numbers! Do we really need to sort ALL the numbers? No, we don't! So let's take a step back.
Let's consider the list of 10 numbers from the beginning of the post:
val nums = Vector(51, 2, 80, 70, 35, 52, 14, 12, 43, 92)
Let's find the largest single number in this list. To do this, we only need two things:
- the list of 10 integers
- 1 integer to store the largest value we've "seen" so far
Here's an example implementation:
def max(coll: Seq[Int]): Int = {
if (coll.isEmpty) {
throw new UnsupportedOperationException("Collection is empty")
}
var result = coll.head
for (e <- coll.tail) {
if (e > result) {
result = e
}
}
result
}
... or a slightly more idiomatic:
def max(coll: Seq[Int]): Option[Int] = coll match {
case Nil => None
case Seq(hd, tl @ _*) => Some(tl.foldLeft(hd)(_ max _))
}
There are two important things to note here:
- There is no sorting taking place.
- Whether a
for
loop or afoldLeft
, we only visit each element once.
The Road Ahead
We can use this same approach to find the largest m values in a sequence of n values.
Given:
- a sequence of values, data
- a quantity of values to extract from data, m
...here is the high-level logic:
def getTop(m, numbers):
collection = SomeCollection()
for value in numbers:
if size of collection < m:
add value to collection
else:
if value > collection.minimum:
remove collection.minimum
add value to collection
return collection
That seems simple enough. But what's that SomeCollection
?
In the next post, I'll unveil our mystery collection and we'll start implementing our solution.
Appendix: "Big O" Analysis
While there are exceptions, sorting algorithms generally range anywhere from O(n log n) to O(n2). It's this complexity --- both Space and Time --- that makes our GC go all wibbly-wobbly.
Consider our max()
method above:
- We visit every element in the list at least once, giving us a linear factor, O(n).
- At each node visited, we either (a) update
result
or (b) do nothing. These are both constant time operations, or O(C).
Factoring these together we get a cumulative complexity of O(Cn). Constants are generally ignored when determining complexity, so we're left with O(n) --- linear time.
What sort of complexity can we expect from getTop()
? We know that we
visit each element at least once, but what is the complexity of the
minimum
, add
, or remove
operations?
As we'll see, this depends on what we use for SomeCollection
, which we'll
discuss in Part 2.