Huon on the internet

10 > 64, in QR codes

By Huon Wilson31 Mar 2024

Encoding data in decimal requires many more characters than the same data encoded in base64—06513249 vs YWJj—but using decimal is better when stored in a QR code. The magic of QR modes means all those extra digits are stored efficiently, almost as though there was no encoding at all. Decimal encoding makes for QR codes that store more data, or are easier to scan.

In this post, we’ll see:

  • how using a decimal encoding (slightly) reduces the density of a QR code containing a URL, in practice;
  • why QR code modes make this work: decimal data is all URL-safe and stores in a QR code Numeric mode efficiently, while base64 ends up with 75% waste because it has to be stored in Binary mode;
  • how this allows shovelling maximum data into a URL in a QR code.
Two QR codes, the left one is version 16 (81 small squares), while the right is slightly less dense, version 14 (73 small squares) base64 decimal

Two QR codes that encode the same data: the left one using base64 is slightly denser than the right using decimal (compare the ‘position’ patterns in the top center). The larger modules (little squares) of the decimal QR makes for easier scanning.

As a case study, in Mechanical sympathy for QR codes: making NSW check-in better, I explored the QR codes used for COVID contact-tracing check-ins here in Australia. It turns out several states plopped a bunch of information in URLs in their QR codes, as a base64-encoded JSON blob, presumably because that’s convenient.

NSW used URLs like the 228-character https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ==, with that eyJ... being the blob. At error correction level H, this can be encoded into a 81×81 QR code (version 16).

If we need to continue storing data in JSON like that (that post describes why we don’t), we can do better than base64: by rewriting the data to decimal, we get the 353-character https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939.

Anyone sensible would know this is longer… 50% longer. Fortunately, QR codes transcend sense, so this requires 20% fewer modules (little squares): it fits into a 73×73 QR code (version 14).

QR codes with fewer modules are easier to scan.

Complicated data in URLs

QR codes can store arbitrary data, but often, they’re used to store a URL, so that anyone can get something useful when they scan by loading a website. At the same time, a QR code might be placed somewhere without internet access, and thus also want to include enough information to allow someone with an appropriate app to interact with it usefully while offline. This can make for URLs with a big blob of data in them.

However, a URL is a constrained container: trying to shove arbitrary data into a URL can hit problems with special characters being misinterpreted or mangled. Fortunately, URLs are mostly just text, and there’s lots of ways to convert arbitrary data into text: the Binary-to-text encoding wikipedia page describes 28, at the time of writing.

Base64 is arguably one of the most common: it is built into every web browser, like the one you’re using to read this. It encodes 3 bytes into 4 selected from a reduced 64-character alphabet. It is a sensible default choice. It can be included literally in JSON, HTML/XML attributes, CSS and, of course, URLs.

All of this means base64 data can end up in QR codes, where a URL containing a blob of data is encoded. Unfortunately, base64 is a poor method for encoding binary data in a QR code: the choice of alphabet forces the QR code to store data in an needlessly inefficient way.

Encoding schemes

There’s many, many ways to encode a set of arbitrary bytes into a reduced character set. The ones we’ll consider in detail are base64, base10, and base45 from RFC 9285, which is explicitly designed for QR codes (others like base16 (hex), base32 or base36 are strictly worse).

encoding characters input:output ratio example URL safe?
none all bytes 1 abc no
base64 0–9, A–Z, a–z, +, / 1.33 YWJj partial
base64url 0–9, A–Z, a–z, -, _ 1.33 YWJj yes
base45 0–9, A–Z, $%*+-.:/, space 1.5 0EC92 no
base10 0–9 2.41 06513249 yes

Some of these are “URL-safe”: encoded data can be splatted directly into a URL without requiring any special escaping or handling, while others are not or might require special handling from a server. We’re talking about putting data into URLs, so URL-safety is a requirement, and thus the choices to consider are:

  • the base64url URL-safe variant of base64, and
  • base10 (decimal).

I’ve invented “base10” by treating the bytes as a huge (little-endian) base-256 integer, and then printing the integer in a different base. This doesn’t scale for large data, but works fine for the couple of kilobytes that can be stored in a QR code. A Python version might be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import math

_DIGITS_PER_BYTE = math.log(2**8, 10)

def b10encode(data: bytes) -> str:
    raw = str(int.from_bytes(data, byteorder="little"))
    # Handle leading zeros so that, for instance, b"", b"\x00"
    # and b"\x00\x00" each have their own unique encoding, not
    # just "0".
    encoded_length = math.ceil(len(data) * _DIGITS_PER_BYTE)
    prefix = "0" * (encoded_length - len(raw))
    return prefix + raw

def b10decode(s: str) -> bytes:
    # Deduce the length of the result from the input, matching
    # b64encode's zero-padding (NB. a real implementation
    # should validate the length is valid)
    decoded_length = math.floor(len(s) / _DIGITS_PER_BYTE)
    return int(s).to_bytes(
        length=decoded_length,
        byteorder="little",
    )

The “input:output ratio” column shows number of output characters required above the number of input characters, on average0.

For example, for encoding in base10, a long input hits the average almost exactly, while short inputs can vary more:

input (hexadecimal) output ratio
01 001 3/1 = 3
12 34 56 05649426 8/3 = 2.67
FF FE … 01 00 00000496…7615 (617 digits) 617/256 = 2.41

QR modes

A QR code stored data in a bitstream, with input data encoded in segments. Each segment of data can be encoded in one of four different modes. Each mode supports different input alphabets, like the “Alphanumeric” mode which only supports storing numbers, upper-case letters and some punctuation, defining a way to map those characters to bits for storage.

mode characters bits per char overhead
Numeric 10: 0–9 3.33 0.34%
Alphanumeric 45: 0–9, A–Z, $%*+-.:/ space 5.5 0.15%
Binary 256: arbitrary bytes 8 0%
Kanji 81891 13 0.0041%

Multiple input characters are stored together for the numeric and alphanumeric modes, which leads to the fractional bits for a single character. For instance, the numeric mode stores groups of 3 digits into 10 bits, like 123456 is encoded in two chunks, 123 then 456, in 20 bits.

The overhead captures how much “wastage” there is. Not much, in any encoding! It’s convenient that 103 is only slightly less than 210, and 452 is less than 211.

When a QR code contains multiple segements of data, it still represents one long stream of characters, so clever choice of splits allows picking the best mode for each significant substring of the input.

Decimal, à la mode

The mode required for storing encoded data depends on the output character set. For the encodings we looked at above2:

encoding characters QR mode
base64url A–Z, a–z, 0–9, -, _ Binary
base10 0–9 Numeric

Base64url contains lower-case letters its output and thus requires the Binary mode when stored in a QR code. 3 input bytes (24 bits of input) turn into 4 output characters, and those 4 characters have to be stored in 4×8 = 32 bits in the QR code. This makes for an average overhead of 33%: 1 input byte is stored as 1.33 bytes in the QR code. Each byte could store up to 256 different values, but the base64 encoding only uses 64 of them. That’s wasting 75% of the values, or 2 bits in every stored byte.

For base10, the calculation isn’t as simple, but we can work through it:

  • The input will be some number of 8-bit bytes, each with 28 = 256 possible values.
  • Each input byte turns into log(256, 10) ≈ 2.408 output digits, on average.
  • 3 digits are stored in 10 bits.
  • Putting this together, each input byte requires 10 / 3 * log(256, 10) ≈ 8.027 bits to store.

This works out to 1 input byte being stored as 1.0034 bytes, on average: an overhead of 0.34%.

This is exactly the overhead of the Numeric mode itself, and there’s no overhead for binary-to-text base10 encoding step! Base10 requires 242% more characters to encode the same data, but those characters can be stored into a QR code efficiently. There’s no waste when stored.

Putting it all together

Let’s go back to our URL from before: https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ==

The data= parameter contains some base64-encoded JSON. I explored in an earlier earlier post how this can be encoded better without JSON or base64… but let’s assume that it needs to remain. We can take the eyJ0I... blob and decode it to the underlying JSON: {"t":"covid19_business","bid":"121321","bname":"Test NSW Government QR code","baddress":"Business address goes here "}.

Running these bytes through the b10encode function above gives a rather long number: 072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939.

The decimal encoding looks long to a human, but it is simpler for the QR code encoding, only consuming barely more space than the raw JSON requires[^base16-base32-base36]:

string length
raw JSON 118 bytes
base64 encoding 160 characters
base64 QR storage 160 bytes
base10 encoding 285 digits
base10 QR storage 119 bytes

After publishing this article, people asked about other encodings like base16 (hex), base32 and base36 (which fit into the Alphanumeric mode), or base 8 (fitting into Numeric mode but being easier to work with than base 10). These are less efficient3 than base10 using Numeric mode, but are potentially more convenient in some cases.

In a URL, the rest of the URL is not purely numeric, so actually seeing the benefits of this encoding requires using two segments:

  • one with the “boring” bits of the URL at the start, likely using the Binary mode
  • one with the big blob of base10 data, using the Numeric mode

Using Segno 1.6.1, this is possible to guarantee4 like:

1
2
3
4
5
url_prefix = "https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data="
encoded_data = b10encode(data)

# Two segments, so that they're encoded with separate modes
qr = segno.make([url_prefix, encoded_data])

The NSW COVID check-in QR codes used error correction level H. Using this, and just swapping the encoding of the data parameter, we see the result from the start of the post:

Two QR codes, same as the ones at the start of the blog post base64 base10

Two QR codes that encode the same data: the left uses base64, giving a 81×81 QR code (version 16), while the right uses base10, giving a 73×73 QR code (version 14).

Extremes

Being clever about modes, encodings and segments allows one to meet the limits of the QR code format. QR codes can store up to 23.6 kilobits ≈ 3.0 KB, by using version 40 with error correction level L. This translates into approximately 7k decimal digits in the Numeric mode, or 3k bytes in the Binary mode.

We can try this out with some test URLs like http://example.com/{encoded_data}. The maximum we can fit into a QR code looks like:

encoding input data URL length
base64url 2.2 KB 2.9k
base10 2.9 KB 7.0k

The much longer URL of base10 theoretically doesn’t matter: the QR mode means it squishes in very efficiently.

However, after publishing this article, it was pointed out that iOS doesn’t correctly scan the huge base10 URL, resolving to just http://example.com instead of http://example.com/310.... So URL length does matter! It seems like this is a different length limit to normal URL restrictions, since 8KB is a reasonable lower limit for “normal” browsers these days, so 7k characters should be safe. Pasting the same link into Safari directly works fine!

Thus, if you’re (ab)using this trick, caveat QRtor: do your own testing.

Two QR codes, both version 40 (177 squares on each side). The one on the left uses base64, the one on the right . base64 decimal

Two maxed-out QR codes that both only slightly different to most humans, but you and I know the second one contains 33% more data.

Summary

Paying attention to the details of the QR code format allow being more efficient. The trick in this post is using the Decimal mode to store arbitrary binary data in URLs, with minimal overhead. This works because QR code modes allow packing particular character sets into the underlying QR bitstream efficiently.

This allows reducing the density of a QR code to make it easier to scan, or jamming in more data.

  1. The average number of output characters required for each input is log(size of input alphabet) / log(size of output alphabet). For instance, for encoding bytes to decimal, log(256) / log(10) = log10(256) = 2.4082…

  2. The 8189 count was computed by this Python one-liner: len({(diff >> 8) * 0xc0 + (diff & 0xff) for diff in [*(code - 0x8140 for code in range(0x8140, 0x9ffc+1)), *(code - 0xc140 for code in range(0xe040, 0xebbf+1))]}). It naively computes the number of unique outputs of Segno’s Kanji encoder, assuming that all inputs from 0x8140 to 09FFC and 0xE040 to 0xEBBF are valid. 

  3. This only uses two of the modes: what about Alphanumeric and Kanji? According to encodeURIComponent the largest subset of Alphanumeric that’s URL safe is only 39 characters (0–9, A–Z, *-.), and losing 5 characters introduces an additional 3.9% overhead, similar to how going from 256 characters to 64 for base64 has 33% overhead. It seems like Kanji are similarly not URL safe, and thus violate that requirement too. 

  4. This table shows the values for all those options, including the results on the example JSON data, sorted by overhead / storage size:

    base mode overhead example encoding example QR storage (bytes) example QR version
    10 Numeric 0.35% 285 digits 119 14
    39 Alphanumeric 3.9%
    log(45) / log(39) - 1
    179 chars 124 14
    36 Alphanumeric 6.2%
    log(45) / log(36) - 1
    183 chars 126 15
    32 Alphanumeric 10%
    5 bits into 1 char (5.5 bits)
    189 chars 130 15
    8
    bits
    Numeric 11%
    3 bits into 1 digit (3.33 bits)
    315 digits 132 15
    8
    bytes
    Numeric 25%
    8 bits into 3 digits (10 bits)
    354 digits 148 15
    64 Binary 33% 160 chars 160 16
    16 Alphanumeric 37.5%
    1 byte (8 bits) into 2 chars (11 bits)
    236 digits 163 16

    There’s two variants of base 8: encoding each byte as its own 3 digit base-8 number (bytes), or ignoring byte boundaries and encoding 3 bits at a time (bits) which is similar to how base 10 and base 36 work. This was suggested by a reader.

    Base 39 is maximal URL-safe subset of the Alphanumeric character set2

  5. In the QR code standard (ISO/IEC 18004:2015(E)), Annex J “Optimisation of bit stream length” suggests a method to automatically split up an input stream to optimise the modes, so a sufficiently smart library could just take a single string and arrange the segments itself.