SIMD in Rust

Huon Wilson

huonw.github.io/simd-sep15

SIMD?

Single Instruction Multiple Data: Do many number things at once.

Why?

Multimedia
3D graphics
Cryptography
Numerical/scientific processing
...

Acronyms, Acronyms Everywhere

Non-embedded devices have SIMD:

x86/x86-64: SSE, AVX, AVX512 (etc.)
ARM/AArch64: NEON, (VFP)
PowerPC: AltiVec
MIPS: MSA, (MDMX, MIPS-3D)
(SPARC: VIS)

`github.com/huonw/simd`

Mandelbrot

fn mandelbrot(c_x: f32, c_y: f32,
              max_iter: u32) -> u32
{
    let mut x = c_x;
    let mut y = c_y;

    let mut count = 0;
    while count < max_iter {
        let xy = x * y;
        let xx = x * x;
        let yy = y * y;
        let sum = xx + yy;

        if sum > 4.0 { break }

        count += 1;

        x = xx - yy + c_x;
        y = xy + xy + c_y;
    }
    count
}

×4

fn mandelbrot(c_x: f32x4, c_y: f32x4,
              max_iter: u32) -> u32x4
{
    let mut x = c_x;
    let mut y = c_y;

    let mut count = u32x4::splat(0);
    for _ in 0..max_iter as usize {
        let xy = x * y;
        let xx = x * x;
        let yy = y * y;
        let sum = xx + yy;
        let mask = sum.lt(f32x4::splat(4.0));
        if !mask.any() { break }

        count = count + mask.to_i().select(u32x4::splat(1),
                                           u32x4::splat(0));
        x = xx - yy + c_x;
        y = xy + xy + c_y;
    }
    count
}

×4: zero overhead

for _ in 0..max_iter as usize {
    let xy = x * y;
    let xx = x * x;
    let yy = y * y;
    let sum = xx + yy;
    let mask = sum.lt(f32x4::splat(4.0));
    if !mask.any() {

        break }

    count = count + mask.to_i().select(u32x4::splat(1),
                                       u32x4::splat(0));
    x = xx - yy + c_x;

    y = xy + xy + c_y;

}

.LBB1_1:
    fmul    v7.4s, v5.4s, v5.4s
    fmul    v16.4s, v6.4s, v6.4s
    fadd    v17.4s, v16.4s, v7.4s
    fcmgt   v17.4s, v3.4s, v17.4s
    umaxv   s18, v17.4s
    fmov    w9, s18
    cbz     w9, .LBB1_3
    fmul    v6.4s, v6.4s, v5.4s
    add     x8, x8, #1
    and     v5.16b, v17.16b, v4.16b
    fsub    v7.4s, v7.4s, v16.4s
    add     v2.4s, v5.4s, v2.4s
    fadd    v5.4s, v7.4s, v0.4s
    fadd    v6.4s, v6.4s, v6.4s
    fadd    v6.4s, v6.4s, v1.4s
    cmp     x8, #100
    b.lo    .LBB1_1

Benchmarks

2.4× faster, on average.

Benchmarks... everywhere

2.1× faster, on average.

Benchmarks... everywhere²

2.4× faster, on average.

Platform-Specific

f64 vectors on x86 with SSE2 and AArch64:

#[cfg(target_feature = "sse2")]
use simd::x86::sse2::*;
#[cfg(target_arch = "aarch64")]
use simd::aarch64::neon::*;

// ...
    let mut dx = [f64x2::splat(0.0); 3];
// ...

Platform-Specific

SSSE3 pshufb instruction:

use simd::x86::ssse3::*;

let x: u8x16 = ...;
let y: u8x16 = ...;

x.shuffle_bytes(y)

Future

Stability
More platforms
More comprehensive support in simd
Dynamic SIMD feature dispatch (choose between foo_avx, foo_sse41, foo_sse2, etc.)
More libraries