linguistics, computers, and puns

TopQueue in Scala - Part 1

July 01, 2018 #Algorithms #Data Structures #Deep Dive #Interviews #Scala #TopQueue

I was once given the following interview question:

Design and implement a method to select the largest m values from a very large list of n numbers.

I liked this question because solving it (the way we'll be doing) touches on a good cross section of software design in Scala:

In this post we'll examine the problem, calling out a few factors that make this question interesting, and propose a design for our solution.

In upcoming posts, we'll follow the design through implementation, functional testing, and add a few bells and whistles.

While the concepts we'll discuss are applicable throughout Computer Science, all of our implementations here will use the Scala programming language.

n to the Great Wide Open

The "simplest" solution is to sort all n numbers and take the last m elements. In Scala, finding the 5 largest numbers in a List of 20 values might look like this:

scala> val rng = util.Random
rng: util.Random.type = scala.util.Random$@3cd9aa64

scala> val (n, m) = (10, 3)
n: Int = 10
m: Int = 3

scala> val randomNums = Vector.fill(n)(rng.nextInt(100))
randomNums: Seq[Int] = Vector(51, 2, 80, 70, 35, 52, 14, 12, 43, 92)

scala> val top_m = randomNums.sorted.takeRight(m)
top_m: Seq[Int] = List(70, 80, 92)

Well, that wasn't so hard, was it? Let's up the game:

scala> val (n, m) = (1000000, 100)
n: Int = 1000000
m: Int = 100

scala> val nums = Vector.fill(n)(rng.nextInt)
nums: scala.collection.immutable.Vector[Int] = Vector(230555202, 453704703, 1552626762, ...)

scala> val topNums = nums.sorted.takeRight(m)
res1: scala.collection.immutable.Vector[Int] = Vector(2147263479, 2147263783, 2147268655, ...)

No sweat! Scala handles that in a little over a second on my laptop.

Attempt #1

How about 100,000,000? It might vary on your own machine, but I received one of the following exceptions every time:

scala> val topNums = nums.sorted.takeRight(m)
java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOfRange(Arrays.java:3664)

Attempt #2

Attempt #1 used the default heap constraints for the JVM. If I increase the heap size to 2GB, I still run out of memory.

scala> val topNums = nums.sorted.takeRight(m)
java.lang.OutOfMemoryError: GC overhead limit exceeded
  at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:177)

Attempt #3

With 3GB of heap, the JVM ground for over 10 minutes before I finally interrupted it.

scala> val topNums = nums.sorted.takeRight(m)
^C

Attempt #4

If I go up to 4GB --- and wait just over a minute --- I finally get the expected result!

scala> val topNums = nums.sorted.takeRight(m)
Vector(2147479698, 2147479714, 2147479772, 2147479777, ...)
// Elapsed time: 63116ms

Brute Force vs the Big O

But that still took a while. And what went wrong the first couple times?

Well, sorting algorithms are expensive, and there are two major factors at play here.

Space Complexity

Sorting requires additional memory to store the hundreds of thousands of intermediate, temporary lists created during the sorting process. It's these temporary objects that consumed all the available heap memory. Even with a heap of 2GB, these objects couldn't fit in memory at once, so the JVM finally collapsed. This is due to the Space Complexity of our solution.

Time Complexity

Increasing the memory to 4GB gave us enough space for our solution, but 60+ seconds is a pretty noticeable chunk of time. This is the Time Complexity of our solution, as we need to sort every element of our 100,000,000 numbers --- a process that can take approximately 1.8 billion distinct comparisons, possibly many more.

Covering algorithmic complexity in depth is beyond the scope of this post, so if this is still voodoo to you, here are some links that should help you get you started:

What was the question?

But we only want the biggest 100 numbers! Do we really need to sort ALL the numbers? No, we don't! So let's take a step back.

Let's consider the list of 10 numbers from the beginning of the post:

val nums = Vector(51, 2, 80, 70, 35, 52, 14, 12, 43, 92)

Let's find the largest single number in this list. To do this, we only need two things:

Here's an example implementation:

def max(coll: Seq[Int]): Int = {
  if (coll.isEmpty) {
    throw new UnsupportedOperationException("Collection is empty")
  }

  var result = coll.head
  for (e <- coll.tail) {
    if (e > result) {
      result = e
    }
  }

  result
}

... or a slightly more idiomatic:

def max(coll: Seq[Int]): Option[Int] = coll match {
  case Nil => None
  case Seq(hd, tl @ _*) => Some(tl.foldLeft(hd)(_ max _))
}

There are two important things to note here:

  1. There is no sorting taking place.
  2. Whether a for loop or a foldLeft, we only visit each element once.

The Road Ahead

We can use this same approach to find the largest m values in a sequence of n values.

Given:

...here is the high-level logic:

def getTop(m, numbers):
    collection = SomeCollection()
    for value in numbers:
        if size of collection < m:
            add value to collection
        else:
            if value > collection.minimum:
                remove collection.minimum
                add value to collection
    return collection

That seems simple enough. But what's that SomeCollection?

In the next post, I'll unveil our mystery collection and we'll start implementing our solution.


Appendix: "Big O" Analysis

While there are exceptions, sorting algorithms generally range anywhere from O(n log n) to O(n2). It's this complexity --- both Space and Time --- that makes our GC go all wibbly-wobbly.

Consider our max() method above:

Factoring these together we get a cumulative complexity of O(Cn). Constants are generally ignored when determining complexity, so we're left with O(n) --- linear time.

What sort of complexity can we expect from getTop()? We know that we visit each element at least once, but what is the complexity of the minimum, add, or remove operations?

As we'll see, this depends on what we use for SomeCollection, which we'll discuss in Part 2.