# TopQueue in Scala - Part 1

July 01, 2018 #Algorithms #Data Structures #Deep Dive #Interviews #Scala #TopQueueI was once given the following interview question:

Design and implement a method to select the largest

values from a`m`

very largelist ofnumbers.`n`

I liked this question because solving it (the way we'll be doing) touches on a good cross section of software design in Scala:

- complexity ("Big O") analysis
- polymorphism (subclassing, traits, and bounds)
- testing!
- ...and a few Scala-flavored ideas:
- companion objects
- implicit values

In this post we'll examine the problem, calling out a few factors that make this question interesting, and propose a design for our solution.

In upcoming posts, we'll follow the design through implementation, functional testing, and add a few bells and whistles.

While the concepts we'll discuss are applicable throughout Computer Science, all of our implementations here will use the Scala programming language.

## n to the Great Wide Open

The "simplest" solution is to sort all * n* numbers and take the last

*elements. In Scala, finding the 5 largest numbers in a List of 20 values might look like this:*

`m`

```
scala> val rng = util.Random
rng: util.Random.type = scala.util.Random$@3cd9aa64
scala> val (n, m) = (10, 3)
n: Int = 10
m: Int = 3
scala> val randomNums = Vector.fill(n)(rng.nextInt(100))
randomNums: Seq[Int] = Vector(51, 2, 80, 70, 35, 52, 14, 12, 43, 92)
scala> val top_m = randomNums.sorted.takeRight(m)
top_m: Seq[Int] = List(70, 80, 92)
```

Well, that wasn't so hard, was it? Let's up the game:

```
scala> val (n, m) = (1000000, 100)
n: Int = 1000000
m: Int = 100
scala> val nums = Vector.fill(n)(rng.nextInt)
nums: scala.collection.immutable.Vector[Int] = Vector(230555202, 453704703, 1552626762, ...)
scala> val topNums = nums.sorted.takeRight(m)
res1: scala.collection.immutable.Vector[Int] = Vector(2147263479, 2147263783, 2147268655, ...)
```

No sweat! Scala handles that in a little over a second on my laptop.

### Attempt #1

How about 100,000,000? It might vary on your own machine, but I received one of the following exceptions every time:

```
scala> val topNums = nums.sorted.takeRight(m)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
```

### Attempt #2

Attempt #1 used the default heap constraints for the JVM. If I
increase the heap size to 2GB, I *still* run out of memory.

```
scala> val topNums = nums.sorted.takeRight(m)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:177)
```

### Attempt #3

With 3GB of heap, the JVM ground for over 10 minutes before I finally interrupted it.

```
scala> val topNums = nums.sorted.takeRight(m)
^C
```

### Attempt #4

If I go up to 4GB --- and wait just over a minute --- I *finally* get
the expected result!

```
scala> val topNums = nums.sorted.takeRight(m)
Vector(2147479698, 2147479714, 2147479772, 2147479777, ...)
// Elapsed time: 63116ms
```

## Brute Force vs the Big O

But that still took a while. And what went wrong the first couple times?

Well, sorting algorithms are expensive, and there are two major factors at play here.

### Space Complexity

Sorting requires additional memory to store the hundreds of thousands
of intermediate, temporary lists created during the sorting
process. It's these temporary objects that consumed all the available
heap memory. Even with a heap of 2GB, these objects couldn't fit in
memory at once, so the JVM finally collapsed. This is due to the
**Space Complexity** of our solution.

### Time Complexity

Increasing the memory to 4GB gave us enough space for our solution,
but 60+ seconds is a pretty noticeable chunk of time. This is the
**Time Complexity** of our solution, as we need to sort *every* element
of our 100,000,000 numbers --- a process that can take approximately
1.8 billion distinct comparisons, possibly many more.

Covering algorithmic complexity in depth is beyond the scope of this post, so if this is still voodoo to you, here are some links that should help you get you started:

- Algorithms, (site, Khan Academy)
- Mastering Algorithms with Perl (book, O'Reilly Media)
- Introduction to Algorithms (book, MIT Press)

## What was the question?

But we only want the biggest *100* numbers! Do we really need to sort
ALL the numbers? No, we don't! So let's take a step back.

Let's consider the list of 10 numbers from the beginning of the post:

```
val nums = Vector(51, 2, 80, 70, 35, 52, 14, 12, 43, 92)
```

Let's find the largest *single* number in this list. To do this, we
only need two things:

- the list of 10 integers
- 1 integer to store the largest value we've "seen"
*so far*

Here's an example implementation:

```
def max(coll: Seq[Int]): Int = {
if (coll.isEmpty) {
throw new UnsupportedOperationException("Collection is empty")
}
var result = coll.head
for (e <- coll.tail) {
if (e > result) {
result = e
}
}
result
}
```

... or a slightly more idiomatic:

```
def max(coll: Seq[Int]): Option[Int] = coll match {
case Nil => None
case Seq(hd, tl @ _*) => Some(tl.foldLeft(hd)(_ max _))
}
```

There are two important things to note here:

- There is no sorting taking place.
- Whether a
`for`

loop or a`foldLeft`

, we only visit each element*once*.

## The Road Ahead

We can use this same approach to find the largest *m* values in a
sequence of *n* values.

Given:

- a sequence of values,
**data** - a quantity of values to extract from
*data*,**m**

...here is the high-level logic:

```
def getTop(m, numbers):
collection = SomeCollection()
for value in numbers:
if size of collection < m:
add value to collection
else:
if value > collection.minimum:
remove collection.minimum
add value to collection
return collection
```

That seems simple enough. But what's that `SomeCollection`

?

In the next post, I'll unveil our mystery collection and we'll start implementing our solution.

## Appendix: "Big O" Analysis

While there are exceptions, sorting algorithms generally
range anywhere from * O(n log n)* to

*. It's this complexity --- both*

**O(n**^{2})*Space*and

*Time*--- that makes our GC go all

*wibbly-wobbly*.

Consider our `max()`

method above:

- We visit every element in the list at least once, giving us a linear
factor,
.**O(n)** - At each node visited, we either (a) update
`result`

or (b) do nothing. These are both*constant*time operations, or.**O(C)**

Factoring these together we get a cumulative complexity of
* O(Cn)*. Constants are generally ignored when determining

*complexity*, so we're left with

*---*

**O(n)***linear time*.

What sort of complexity can we expect from `getTop()`

? We know that we
visit each element at least once, but what is the complexity of the
`minimum`

, `add`

, or `remove`

operations?

As we'll see, this depends on what we use for `SomeCollection`

, which we'll
discuss in Part 2.