Huon on the internet

simple_parallel 0.3: Revisiting k-NN

By Huon Wilson24 Oct 2015

I recently released version 0.3 of my simple_parallel crate, which builds on Aaron Turon’s crossbeam to resolve the stability and safety difficulties: the crate now works with Rust 1.3.0 stable, and offers safe data-parallel for loops and maps.

I still don’t recommend it for general use, but I think it’s a neat demonstration of what Rust’s type system allows, and hopefully inspiration for something awesome.

simple_parallel in 16 lines

Taster: safely setting the values in an array stored directly on the stack of a parent thread, in parallel, with 4 threads.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// (add `simple_parallel = "0.3"` to your Cargo.toml)
extern crate simple_parallel;

fn main() {
    let mut pool = simple_parallel::Pool::new(4);

    let mut stack_array = [0; 10];

    let large_complicated_thing = vec![4, 3, 2, 1];

    pool.for_(stack_array.iter_mut().enumerate(), |(i, elem)| {
        *elem = large_complicated_thing[i % 4]
    });

    println!("{:?}", &stack_array);
}

This is the same as writing for (i, elem) in stack_array.iter_mut().enumerate() { ... }, and the output is: [4, 3, 2, 1, 4, 3, 2, 1, 4, 3].

It is a rather complicated way to initialise an array with those values, but it demonstrates some nice properties:

  • stack_array is allocated directly on the stack of the main thread; it doesn’t need to be pushed to the heap. Lifetimes and simple_parallel’s API ensures that the subthreads can’t hold references to it for too long statically (no GC necessary).
  • the large_complicated_thing Vec is safely shared between all threads, no copies necessary. Each thread gets a &Vec<_> reference, and they all point to the same large_complicated_thing value stored on the main thread’s stack. Again lifetimes ensure that the references won’t be dangling, but more interestingly the vector can be read without needing to copy or lock: zero-overhead immutable shared data.
  • the iter_mut method creates an iterator over &mut references to the elements of stack_array. The closure is called on each of them in parallel, and the references are disjoint/not-aliasing0, meaning each call is manipulating a different section of memory. No atomics or locks are needed to mutate what the iterator feeds the closure.
  • the simple_parallel APIs all consume (nearly) arbitrary iterators: I can take the slice iterator and create a new (lazy) iterator via enumerate, pairing the output of the slice iterator with indices. (There are of course restrictions about the thread-safety properties of the iterator and its elements, necessary to get points above safely.)

Some of this is driven by simple_parallel, some of it is crossbeam, but most of it is the power of Rust’s type system: it comes together just right to ensure1 concurrency can be done fearlessly.

It’s not perfect, it’s not even great—there’s unnecessary overhead and it doesn’t offer many operations—but has been useful for me (e.g. speeding up processing some pictures just required replacing for photo in photos with pool.for_(photos, |photo|) and serves as a neat little exploration into Rust’s type system. I’m confident we’ll see better libraries from better programmers that allow for some magical things.

k-NN

The very first post on this blog was Comparing k-NN in Rust, which ended with parallelising the task of validating a k-nearest neighbour (k-NN) classifier, using the safe-but-crude tools Rust-circa-0.11 offered at the time. We now live in a promised land, with language stability, Cargo & thousands of crates, and Send without 'static, so there’s shiny new safe-and-less-crude tools!

The example above disguised the role of crossbeam, but the simple_parallel/crossbeam combo means it’s easy to process a stream in parallel, sharing data from parent threads with no overhead at all. The k-NN code loads two files into Vecs of 784-dimensional “pixels” via slurp_file—one of 5000 pixels of training data and one of 500 samples to test the classifier against—and then uses the classify function to predict a label for each of the validation samples based on the training ones, finally printing how many were predicted correctly.

The full code is available at huonw/revisiting-knn, updated from 0.11.0 (which wasn’t too hard at all2), so I’m just going to focus on the interesting bit: main. The sequential version is short:

1
2
3
4
5
6
7
8
9
10
11
fn main() {
    let training_set = slurp_file("trainingsample.csv");
    let validation_sample = slurp_file("validationsample.csv");

    let num_correct = validation_sample.iter()
        .filter(|x| classify(&training_set, &x.pixels) == x.label)
        .count();

    println!("Percentage correct: {:.1}%",
             num_correct as f64 / validation_sample.len() as f64 * 100.0);
}

All the work is happening in the filter call: the classify function is the expensive one: each call does 5000 784-dimensional vector distance calculations. perf’s instruction level profiling tells me that nearly 95% of the time is spent in the loop3 for that calculation (which actually gets inlined all the way into main itself).

The parallel main isn’t much longer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
fn main() {
    // load files
    let training_set = slurp_file("trainingsample.csv");
    let validation_sample = slurp_file("validationsample.csv");

    // create a thread pool
    let mut pool = simple_parallel::Pool::new(4);

    crossbeam::scope(|scope| {
        let num_correct =
            pool.unordered_map(scope, &validation_sample, |x| {
                // is it classified right? (in parallel)
                classify(&training_set, &x.pixels) == x.label
            })
            .filter(|t| t.1)
            .count();

        println!("Percentage correct: {:.1}%",
                 num_correct as f64 / validation_sample.len() as f64 * 100.0);
    });
}

The unordered_map function is a bit more complicated than for_ above: this takes an iterator over As, and a function from A to any type B, and returns an iterator of (usize, B)s, in some random order4. For above, B is a bool: whether the predicted classification was correct or not.

The hardest part of parallelising that was working out where to put the crossbeam::scope: as I wrote it above, or let num_correct = crossbeam::scope(...);. It was correct and ran in parallel the first time!

Interestingly, this code benefits greatly from being able to share stacks: firstly, each x in the loop is just a pointer into the validation_sample Vec, it can point right to the large chunk of memory (for a 784D vector) owned by the main thread without having to copy. Secondly, and even better, all the parallel classify calls can read share the one huge training_set array without needing to copy it, which is 5000 of those high-dimensional vectors.

Numbers

Discussing parallelism means nothing without proving it is doing something useful: making things run faster. The sequential code took 1.86s, and the parallel version slashed that to 0.69s, 2.7× faster.

I’m running 1.3.0 stable, and compiled the the code with cargo build --release --features sequential and cargo build --release for the two main functions above, the rest of the code stayed the same. I measured the time to run with perf stat -r 5 ....

I’m on a different/faster computer to the previous post, and the original Rust code no longer compiles. However, the OCaml code still does: with the same compiler, it takes approximately 8.5 seconds on this one, about 1.6-1.7× faster. The sequential Rust code is nearly 2× faster than the old version—meaning the compiler/standard library has likely improved—and the parallel version even more… but this computer has more cores so the comparison isn’t so interesting.

crossbeam::scope

The key trick that allowed me to get the APIs to be safe is this scope function. It was somewhat infamously realised that destructors cannot be relied upon for scope-based memory safety: it is possible to leave a scope without running a destructor, in safe code, e.g. get the value stuck in a reference cycle of Rcs. This means that if a library ever hands away an instance of something with a destructor, it has to be sure that things won’t go completely wrong if that destructor never executes.

The alternative used in crossbeam was described in a Rust RFC (that never landed in Rust itself) written by Aaron, crossbeam’s author. The approach is to be a control freak: never let anyone else control your value, only hand off & references to it, so what runs when is in your power, and yours alone. This is what scope does,

Its signature is:

1
2
pub fn scope<'a, F, R>(f: F) -> R
    where F: FnOnce(&Scope<'a>) -> R

That is, scope takes one argument, which is a closure, and then passes a reference to a Scope to that closure. This Scope object allows for spawning threads/deferring functions, with the guarantee that the thread will exit/the function will run before scope returns.

Only the iterator adapters like unordered_map (rather than consumers like for_) in simple_parallel need to think about this externally: they act asynchronously, and so return an object that can’t live too long and needs to control the threads spawned to do the parallel processing, where as the consumers can just block. By taking a Scope argument, these iterator adapters can defer functions that do the thread control, giving the nice iterator APIs with a fairly minimal usage overhead (wrapping the calls in a closure and passing an extra argument).

  1. The pointers are disjoint, so safety/semantically everything is cool, but in the real world there are performance concerns, like false sharing: the values are all adjacent in memory and so probably share a cache-line. In practice, this shouldn’t be a problem: either the time to compute each value will be significant, so the false sharing hit is irrelevant, or one doesn’t need to/shouldn’t parallelise at the level of individual elements. 

  2. Strictly speaking, “ensure” isn’t quite right: there’s not a formal proof that Send/Sync/… all do have this guarantee, but there is work on formalisations of Rust that will either tackle this directly, or form important groundwork. 

  3. This code is pretty simple, so was barely affected by language/library changes: some import paths changed, some imports became necessary and others could be dropped, and all the .as_slice() calls disappeared. 

  4. The code is actually slower than strictly necessary: if the fold is separated into .map(|(&a, &b)| a - b).fold(0, |s, d| s + d * d), it is autovectorised by LLVM to use SIMD instructions and runs twice as fast. However, I decided against doing this in the spirit of Rust 1.3 v. Rust 0.11 comparisons: IIRC the old code didn’t get SIMD-ified. 

  5. There’s also the plain old map function, which is careful to return the elements in the same order as they were in the original iterator. This is a bit more expensive.