Huon on the internet

Async hazard: mmap is secretly blocking IO

By Huon Wilson21 Aug 2024

Memory mapping a file for reading sounds nice: turn inconvenient read calls and manual buffering into just simple indexing of a memory… but it does blocking IO under the hood, turn a &[u8] byte arrays into an async hazard and making “concurrent” async code actually run sequentially!

Code affected likely runs slower, underutilises machine resources, and has undesirable latency spikes.

I’ve done some experiments in Rust that show exactly what this means, but I think this applies to any system that doesn’t have special handling of memory-mapped IO (including Python, and manual non-blocking IO in C).

I want numbers

I set up some benchmarks that scan through 8 files, each 256 MiB, summing the value of every 512th byte. This is an uninteresting task, but it is designed to maximise IO usage and minimise everything else. This is thus a worst case, the impact on real code is unlikely to be quite this severe!

Results of 100 iterations of the benchmarks on macOS on a M1 Macbook Pro, one plotted point per iteration, with a cold file system cache.

There’s 6 configurations, using all combinations of how files are read and what concurrency is used.0

For reading files, there’s two dimensions:

  1. conventional IO: using explicit read calls.
  2. memory-mapped IO: using the memmap2 library.

For concurrency, there’s three:

  1. async/await: using the Tokio library1 in single-threaded mode2.
  2. synchronously, with 8 threads: spawning a thread per file.
  3. synchronously, with 1 thread: a baseline using a single thread to read sequentially.

The results3 are clear: using memory mapped IO with async/await seems to be using no concurrency: the clusters of dots are near-identical for the first two rows! An explicitly-sequential single thread takes 2.5 to 3 seconds to scan all 2GiB with memory-mapped IO, and the “concurrent” async/await version looks identical. By comparison, using memory-mapped IO on 8 full threads is far faster, around 0.75 to 0.8 seconds.

Meanwhile, using conventional IO behaves more like we’d hope: using either form of concurrency is noticeably faster than running sequentially. Using either async/await or operating system threads results in reading all files in less than 0.65 seconds4, while the sequential single-threaded version takes only a bit longer (0.7s) but with little overlap in the distributions.

What’s going on?

Operating systems generally distinguish between bytes in RAM and files stored on disk. Reading a file is a dedicated operation that slurps bytes from disk into an array in memory. But there’s more shades of gray: the mmap Unix system call (or equivalent on other platforms) blurs this disk vs. memory distinction by setting up “memory mapped IO”.

The mmap operation allocates a range of (virtual) memory to a chosen file, making that memory “contain” the file: reading5 the first byte in memory gives the value of the first byte of the file, as does the second, and so on. The file essentially acts as a normal &[u8] array that can be indexed and sliced.

1
2
3
4
5
6
7
8
let file = std::fs::File::open("data.bin")?;
let mmapped = unsafe { memmap2::Mmap::map(file)? };

let first_byte: u8 = mmapped[0];
println!("File starts {first_bytes}");

let central_bytes: &[u8] = &mmapped[1200..1300];
println!("File contains {central_bytes:?}");

This sounds a lot like what one would get by manually allocating a memory array as a buffer, seeking within the file, reading the file to fill that buffer, and then accessing the buffer… just without all the book-keeping. What’s not to like?

Under the hood

To make this work, the operating system has to do magic.

One possible implementation might be to literally have the operating system allocate a chunk of physical memory and load the file into it, byte by byte, right when mmap is called… but this is slow, and defeats half the magic of memory mapped IO: manipulating files without having to pull them into memory. Maybe the machine has 4GiB of RAM installed, but we want to manipulate a 16GiB file.

Instead, operating systems will use the power of virtual memory: this allows some parts of the file to be in physical memory, while other parts are happily left on disk, and different parts can be paged in and out of memory as required. Sections of file are only loaded into memory as they are accessed.

1
2
3
4
5
let mmapped: Mmap = /* map some file into memory */;

let index = 12345;
let byte_of_interest = mmapped[index];
println!("byte #{index} has value {byte_of_interest}");

That indexing accesses the 12 345th byte of the file. On the first access, there’ll be a page fault, and the operating system takes over: it loads the relevant data from the file on disk into actual physical memory, meaning the data is now available directly in normal RAM. This includes loading a small surrounding region—a page—which means future accesses to nearby addresses are fast.

While loading, the thread and its code will be blocked: the thread will be descheduled and won’t make further progress, because the next lines need the byte value to do their work.

Once loaded, the page of the file will be cached in memory for a while, and during that time any further accesses to addresses in the page will be fast, straight from memory. Depending on system memory usage and other factors, the operating system might then decide to evict the page from memory, freeing space for something else. Accessing it after eviction will require reloading from disk, with the same blocking behaviour and the cycle repeats.

In our application code, all this magic is packaged up into the simple indexing: byte_of_interest = mmapped[index]. The separate “already in memory” (fast) and “load from disk” (slow) cases are invisible6 in the code. In either case, we start with our memory address and end up with byte value.

Losing the thread

This magic is why our concurrency doesn’t work. The async/await construct is co-operative scheduling, with a “executor” or “event loop” that manages many tasks, swapping between them as required… but this swapping is only possible at explicit await points. If a task runs for a long time or is “blocked” without an await, the executor won’t be able to preempt it to run another task. Others have written more about this blocking pitfall, far more knowledgeably than I.

Remember that the code just looks like let byte_of_interest = mmapped[index];? There’s no await, so the async/await executor cannot swap to another task. Instead, the operation blocks and deschedules the whole thread… But the executor is just normal code using that thread too, so the very thing that coordinates the concurrency is descheduled and can’t do any other work.

This is different to proper async IO, the individual task will still need to wait for the data to be read, but awaits mean the executor can swap to another task while that happens: concurrency!

1
2
3
let mut file: tokio::fs::File = /* some file */;
let mut buffer: [u8; N] = [0; N];
file.read(&mut buffer).await?; // crucial await

At a distance

The most subtle part of this hazard to me is that the memory-mapped file looks like a normal buffer in memory. Code is likely to be implicitly assuming that indexing a &[u8] is fast, and does not secretly read a file.

For instance, memmap2::Mmap derefs to [u8], meaning we can take fn f(b: &[u8]) { ... }, and call it like f(&mmapped). That function just sees b: &[u8], and won’t have any idea that indexing b might do blocking IO. It won’t know that it needs to take care with any co-operative concurrency it is performing.

Rust is just the demonstration here, this applies universally: Python’s mmap explicitly says “behave like … bytearray”, while the C/POSIX mmap function returns void * (same as malloc). The whole point of memory-mapping is pretending a file is a byte array, and that’s what bites us here!

Caching: Cha-ching!

We’ve seen above that the problem is when data has to be loaded from disk into memory. What happens if the data is in memory already? We might expect memory-mapped IO to run much faster in that case, since it’s theoretically just accessing normal memory, straight from RAM.

This version of the benchmark times how long it takes to run the same task, immediately after doing the cold version we saw above, without purging caches.

100 iterations of the same benchmark as above, but with a warm file system cache.

Our theory was right, it is much faster: with a cold cache, we were seeing memory-mapped IO on a single thread take 2.5s to read 2GiB. Now it takes 0.12s, 21× quicker, with async/await or without3.

There’s some other notable observations about running with a warm cache:

  • Memory-mapped IO is faster than conventional IO. I imagine because there’s less overhead: for conventional IO, there’s an extra memcpy of data from the operating system page cache into the buffer passed into the read call. Using Tokio and async/await has even more overhead, likely because there’s synchronisation overhead with the thread-pool that has to be used for asynchronous file IO.
  • Asynchronous and single-threaded sequential memory-mapped IO still take the time same time as each other, same as with a cold cache… just both are now fast.
  • If you want speed, using multiple threads is handy: memory-mapped IO is achieving “file” IO speeds of ~41GiB/s, by reading 2GiB in 49ms.
  • This is likely system-dependent: I was running on arm64 macOS, and other systems may see different results.

More questions

I’ve done some experiments here, putting the “science” in “computer science”, and of course there’s more questions: I don’t know the answers!

  1. How does this manifest across other platforms, beyond arm64 macOS? (Hypothesis: similar on all platforms.)
  2. Does the type of disk backing the files influence behaviour? (Hypothesis: yes, I’d expect running this on a high-latency disk like a spinning-rust HDD to show even worse blocking behaviour than the SSD used here, as each thread will be blocked for longer.)
  3. Does prefetching via CPU instructions help mitigate the problems? (Hypothesis: not sure. I have no idea if running a prefetch instruction on a memory-mapped address will cause the page to be loaded in the background. One may have to prefetch many, many loop iterations in advance, since this is reading all the way from disk, not just from RAM to CPU cache, for which prefetching is often used.)
  4. How do other mmap/madvise options influence this (for instance, MADV_SEQUENTIAL, MADV_WILLNEED, MADV_POPULATE, MADV_POPULATE_READ, mlock)? (Hypothesis: these options will make it more likely that data is pre-cached and thus fall into fast path more often, but without a guarantee.)
  5. What about readahead? (Hypothesis: making the fast path more likely but not guaranteed, similar to the previous point.)

Conclusion

Accessing files via memory mapped IO makes for a very convenient API: just read a byte with normal indexing. However, it’s error-prone when used with co-operatively scheduled concurrency, like async/await in Rust or Python, or even “manual” non-blocking IO in C:

  • If the data isn’t available in memory yet, the operating system will have to block the thread while reading from disk, and there’s no awaits to allow the async executor to run another task.
  • Thus, heavy use of mmap can potentially be as slow as sequential code!
  • It’s a hazard at a distance: a memory-mapped file can be coerced to a look like a normal byte array (&[u8], bytearray/list[int], or void */char *) and passed deep into the async code before being indexed.
  • If data is cached by the operating system, mmap has less overhead than conventional IO and can be a little faster (on my system).
  1. I have a script that repeatedly measures the time to scan through all 2GiB, in each configuration. Before each measurement, it drops any file system caches with purge to run with a cold cache. 

  2. The Tokio benchmarks iterate over each file concurrently with a join_all call. This uses the tokio::fs::File type for “conventional IO”. Note: this uses a thread-pool in the background, because it seems like there’s limited support for truly asynchronous file IO, but the key is the API exposed, using async/await

  3. The benchmarks use the single threaded runtime to make the problem as obvious and clear as possible: using a multi-thread runtime may disguise the consequences of mmap’s blocking IO by smearing it across multiple threads. 

  4. Here’s a table with the full results, including both cold and warm file system caches, taking the minimum value of each benchmark (that is, the most favourable observation):

    Concurrency IO Cold (s) Warm (s)
    Async Mmap 2.5 0.12
    Async Conventional 0.62 0.22
    Sync 8 threads Mmap 0.75 0.049
    Sync 8 threads Conventional 0.62 0.063
    Sync 1 thread Mmap 2.5 0.12
    Sync 1 thread Conventional 0.67 0.14

     2

  5. I think this is maxing out the machine’s SSD bandwidth, at around 3300 MiB/s, hence it’s not much faster than the single-threaded version. 

  6. This article focuses on my observations of using mmap for reading a file, not writing. As I understand it, writing comes with its own problems, potentially including things like unclear persistence guarantees. But I’m not an expert nor have I investigated. 

  7. The cases are invisible except through side-channels like timing the operation, observing memory usage stats, or page allocation tables. I’m not counting them.