SIMD in Rust

Huon Wilson

Mozilla Research & University of Sydney


huonw.github.io/simd-aug15

SIMD?

Single Instruction Multiple Data: Do many number things at once.

Acronyms, Acronyms Everywhere

Non-embedded devices have SIMD:

RFC #1199

PR #27169 (et al.)

github.com/huonw/simd

Publicity

Mandelbrot

fn mandelbrot(c_x: f32, c_y: f32,
              max_iter: u32) -> u32
{
    let mut x = c_x;
    let mut y = c_y;

    let mut count = 0;
    while count < max_iter {
        let xy = x * y;
        let xx = x * x;
        let yy = y * y;
        let sum = xx + yy;

        if sum > 4.0 { break }

        count += 1;

        x = xx - yy + c_x;
        y = xy + xy + c_y;
    }
    count
}

×4

fn mandelbrot(c_x: f32x4, c_y: f32x4,
              max_iter: u32) -> u32x4
{
    let mut x = c_x;
    let mut y = c_y;

    let mut count = u32x4::splat(0);
    for _ in 0..max_iter as usize {
        let xy = x * y;
        let xx = x * x;
        let yy = y * y;
        let sum = xx + yy;
        let mask = sum.lt(f32x4::splat(4.0));
        if !mask.any() { break }

        count = count + mask.to_i().select(u32x4::splat(1),
                                           u32x4::splat(0));
        x = xx - yy + c_x;
        y = xy + xy + c_y;
    }
    count
}

×4: zero overhead

for _ in 0..max_iter as usize {
let xy = x * y; let xx = x * x; let yy = y * y;
let sum = xx + yy;
let mask = sum.lt(f32x4::splat(4.0));
if !mask.any() { break }
count = count + mask.to_i().select(u32x4::splat(1), u32x4::splat(0));
x = xx - yy + c_x;
y = xy + xy + c_y;
}
.LBB1_1:
fmul v7.4s, v5.4s, v5.4s fmul v16.4s, v6.4s, v6.4s
fadd v17.4s, v16.4s, v7.4s
fcmgt v17.4s, v3.4s, v17.4s
umaxv s18, v17.4s fmov w9, s18 cbz w9, .LBB1_3
fmul v6.4s, v6.4s, v5.4s
add x8, x8, #1
and v5.16b, v17.16b, v4.16b
fsub v7.4s, v7.4s, v16.4s
add v2.4s, v5.4s, v2.4s
fadd v5.4s, v7.4s, v0.4s fadd v6.4s, v6.4s, v6.4s fadd v6.4s, v6.4s, v1.4s
cmp x8, #100 b.lo .LBB1_1

Benchmarks...

2.4× faster, on average.

Benchmarks... everywhere

2.1× faster, on average.

Benchmarks... everywhere2

2.4× faster, on average.

Future