10 > 64, in QR codes
Encoding data in decimal requires many more characters than the same data encoded in base64—06513249
vs YWJj
—but using decimal is better when stored in a QR code. The magic of QR modes means all those extra digits are stored efficiently, almost as though there was no encoding at all. Decimal encoding makes for QR codes that store more data, or are easier to scan.
In this post, we’ll see:
- how using a decimal encoding (slightly) reduces the density of a QR code containing a URL, in practice;
- why QR code modes make this work: decimal data is all URL-safe and stores in a QR code Numeric mode efficiently, while base64 ends up with 75% waste because it has to be stored in Binary mode;
- how this allows shovelling maximum data into a URL in a QR code.
As a case study, in Mechanical sympathy for QR codes: making NSW check-in better, I explored the QR codes used for COVID contact-tracing check-ins here in Australia. It turns out several states plopped a bunch of information in URLs in their QR codes, as a base64-encoded JSON blob, presumably because that’s convenient.
NSW used URLs like the 228-character https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ==
, with that eyJ...
being the blob. At error correction level H
, this can be encoded into a 81×81 QR code (version 16).
If we need to continue storing data in JSON like that (that post describes why we don’t), we can do better than base64: by rewriting the data to decimal, we get the 353-character https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939
.
Anyone sensible would know this is longer… 50% longer. Fortunately, QR codes transcend sense, so this requires 20% fewer modules (little squares): it fits into a 73×73 QR code (version 14).
QR codes with fewer modules are easier to scan.
Complicated data in URLs
QR codes can store arbitrary data, but often, they’re used to store a URL, so that anyone can get something useful when they scan by loading a website. At the same time, a QR code might be placed somewhere without internet access, and thus also want to include enough information to allow someone with an appropriate app to interact with it usefully while offline. This can make for URLs with a big blob of data in them.
However, a URL is a constrained container: trying to shove arbitrary data into a URL can hit problems with special characters being misinterpreted or mangled. Fortunately, URLs are mostly just text, and there’s lots of ways to convert arbitrary data into text: the Binary-to-text encoding wikipedia page describes 28, at the time of writing.
Base64 is arguably one of the most common: it is built into every web browser, like the one you’re using to read this. It encodes 3 bytes into 4 selected from a reduced 64-character alphabet. It is a sensible default choice. It can be included literally in JSON, HTML/XML attributes, CSS and, of course, URLs.
All of this means base64 data can end up in QR codes, where a URL containing a blob of data is encoded. Unfortunately, base64 is a poor method for encoding binary data in a QR code: the choice of alphabet forces the QR code to store data in an needlessly inefficient way.
Encoding schemes
There’s many, many ways to encode a set of arbitrary bytes into a reduced character set. The ones we’ll consider in detail are base64, base10, and base45 from RFC 9285, which is explicitly designed for QR codes (others like base16 (hex), base32 or base36 are strictly worse).
encoding | characters | input:output ratio | example | URL safe? |
---|---|---|---|---|
none | all bytes | 1 | abc |
no |
base64 | 0–9, A–Z, a–z, +, / | 1.33 | YWJj |
partial |
base64url | 0–9, A–Z, a–z, -, _ | 1.33 | YWJj |
yes |
base45 | 0–9, A–Z, $%*+-.:/, space | 1.5 | 0EC92 |
no |
base10 | 0–9 | 2.41 | 06513249 |
yes |
Some of these are “URL-safe”: encoded data can be splatted directly into a URL without requiring any special escaping or handling, while others are not or might require special handling from a server. We’re talking about putting data into URLs, so URL-safety is a requirement, and thus the choices to consider are:
- the base64url URL-safe variant of base64, and
- base10 (decimal).
I’ve invented “base10” by treating the bytes as a huge (little-endian) base-256 integer, and then printing the integer in a different base. This doesn’t scale for large data, but works fine for the couple of kilobytes that can be stored in a QR code. A Python version might be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import math
_DIGITS_PER_BYTE = math.log(2**8, 10)
def b10encode(data: bytes) -> str:
raw = str(int.from_bytes(data, byteorder="little"))
# Handle leading zeros so that, for instance, b"", b"\x00"
# and b"\x00\x00" each have their own unique encoding, not
# just "0".
encoded_length = math.ceil(len(data) * _DIGITS_PER_BYTE)
prefix = "0" * (encoded_length - len(raw))
return prefix + raw
def b10decode(s: str) -> bytes:
# Deduce the length of the result from the input, matching
# b64encode's zero-padding (NB. a real implementation
# should validate the length is valid)
decoded_length = math.floor(len(s) / _DIGITS_PER_BYTE)
return int(s).to_bytes(
length=decoded_length,
byteorder="little",
)
The “input:output ratio” column shows number of output characters required above the number of input characters, on average0.
For example, for encoding in base10, a long input hits the average almost exactly, while short inputs can vary more:
input (hexadecimal) | output | ratio |
---|---|---|
01 |
001 |
3/1 = 3 |
12 34 56 |
05649426 |
8/3 = 2.67 |
FF FE … 01 00 |
00000496…7615 (617 digits) |
617/256 = 2.41 |
QR modes
A QR code stored data in a bitstream, with input data encoded in segments. Each segment of data can be encoded in one of four different modes. Each mode supports different input alphabets, like the “Alphanumeric” mode which only supports storing numbers, upper-case letters and some punctuation, defining a way to map those characters to bits for storage.
mode | characters | bits per char | overhead |
---|---|---|---|
Numeric | 10: 0–9 | 3.33 | 0.34% |
Alphanumeric | 45: 0–9, A–Z, $%*+-.:/ space | 5.5 | 0.15% |
Binary | 256: arbitrary bytes | 8 | 0% |
Kanji | 81891 | 13 | 0.0041% |
Multiple input characters are stored together for the numeric and alphanumeric modes, which leads to the fractional bits for a single character. For instance, the numeric mode stores groups of 3 digits into 10 bits, like 123456
is encoded in two chunks, 123
then 456
, in 20 bits.
The overhead captures how much “wastage” there is. Not much, in any encoding! It’s convenient that 103 is only slightly less than 210, and 452 is less than 211.
When a QR code contains multiple segements of data, it still represents one long stream of characters, so clever choice of splits allows picking the best mode for each significant substring of the input.
Decimal, à la mode
The mode required for storing encoded data depends on the output character set. For the encodings we looked at above2:
encoding | characters | QR mode |
---|---|---|
base64url | A–Z, a–z, 0–9, -, _ | Binary |
base10 | 0–9 | Numeric |
Base64url contains lower-case letters its output and thus requires the Binary mode when stored in a QR code. 3 input bytes (24 bits of input) turn into 4 output characters, and those 4 characters have to be stored in 4×8 = 32 bits in the QR code. This makes for an average overhead of 33%: 1 input byte is stored as 1.33 bytes in the QR code. Each byte could store up to 256 different values, but the base64 encoding only uses 64 of them. That’s wasting 75% of the values, or 2 bits in every stored byte.
For base10, the calculation isn’t as simple, but we can work through it:
- The input will be some number of 8-bit bytes, each with 28 = 256 possible values.
- Each input byte turns into
log(256, 10)
≈ 2.408 output digits, on average. - 3 digits are stored in 10 bits.
- Putting this together, each input byte requires
10 / 3 * log(256, 10)
≈ 8.027 bits to store.
This works out to 1 input byte being stored as 1.0034 bytes, on average: an overhead of 0.34%.
This is exactly the overhead of the Numeric mode itself, and there’s no overhead for binary-to-text base10 encoding step! Base10 requires 242% more characters to encode the same data, but those characters can be stored into a QR code efficiently. There’s no waste when stored.
Putting it all together
Let’s go back to our URL from before: https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ==
The data=
parameter contains some base64-encoded JSON. I explored in an earlier earlier post how this can be encoded better without JSON or base64… but let’s assume that it needs to remain. We can take the eyJ0I...
blob and decode it to the underlying JSON: {"t":"covid19_business","bid":"121321","bname":"Test NSW Government QR code","baddress":"Business address goes here "}
.
Running these bytes through the b10encode
function above gives a rather long number: 072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939
.
The decimal encoding looks long to a human, but it is simpler for the QR code encoding, only consuming barely more space than the raw JSON requires:
string | length |
---|---|
raw JSON | 118 bytes |
base64 encoding | 160 characters |
base64 QR storage | 160 bytes |
base10 encoding | 285 digits |
base10 QR storage | 119 bytes |
After publishing this article, people asked about other encodings like base16 (hex), base32 and base36 (which fit into the Alphanumeric mode), or base 8 (fitting into Numeric mode but being easier to work with than base 10). These are less efficient3 than base10 using Numeric mode, but are potentially more convenient in some cases.
In a URL, the rest of the URL is not purely numeric, so actually seeing the benefits of this encoding requires using two segments:
- one with the “boring” bits of the URL at the start, likely using the Binary mode
- one with the big blob of base10 data, using the Numeric mode
Using Segno 1.6.1, this is possible to guarantee4 like:
1
2
3
4
5
url_prefix = "https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data="
encoded_data = b10encode(data)
# Two segments, so that they're encoded with separate modes
qr = segno.make([url_prefix, encoded_data])
The NSW COVID check-in QR codes used error correction level H
. Using this, and just swapping the encoding of the data
parameter, we see the result from the start of the post:
Extremes
Being clever about modes, encodings and segments allows one to meet the limits of the QR code format. QR codes can store up to 23.6 kilobits ≈ 3.0 KB, by using version 40 with error correction level L. This translates into approximately 7k decimal digits in the Numeric mode, or 3k bytes in the Binary mode.
We can try this out with some test URLs like http://example.com/{encoded_data}
. The maximum we can fit into a QR code looks like:
encoding | input data | URL length |
---|---|---|
base64url | 2.2 KB | 2.9k |
base10 | 2.9 KB | 7.0k |
The much longer URL of base10 theoretically doesn’t matter: the QR mode means it squishes in very efficiently.
However, after publishing this article, it was pointed out that iOS doesn’t correctly scan the huge base10 URL, resolving to just http://example.com
instead of http://example.com/310...
. So URL length does matter! It seems like this is a different length limit to normal URL restrictions, since 8KB is a reasonable lower limit for “normal” browsers these days, so 7k characters should be safe. Pasting the same link into Safari directly works fine!
Thus, if you’re (ab)using this trick, caveat QRtor: do your own testing.
Summary
Paying attention to the details of the QR code format allow being more efficient. The trick in this post is using the Decimal mode to store arbitrary binary data in URLs, with minimal overhead. This works because QR code modes allow packing particular character sets into the underlying QR bitstream efficiently.
This allows reducing the density of a QR code to make it easier to scan, or jamming in more data.
- Hacker News
- Lobsters
-
The average number of output characters required for each input is
log(size of input alphabet) / log(size of output alphabet)
. For instance, for encoding bytes to decimal,log(256) / log(10) = log10(256) = 2.4082…
. ↩ -
The 8189 count was computed by this Python one-liner:
len({(diff >> 8) * 0xc0 + (diff & 0xff) for diff in [*(code - 0x8140 for code in range(0x8140, 0x9ffc+1)), *(code - 0xc140 for code in range(0xe040, 0xebbf+1))]})
. It naively computes the number of unique outputs of Segno’s Kanji encoder, assuming that all inputs from0x8140
to09FFC
and0xE040
to0xEBBF
are valid. ↩ -
This only uses two of the modes: what about Alphanumeric and Kanji? According to
encodeURIComponent
the largest subset of Alphanumeric that’s URL safe is only 39 characters (0–9, A–Z, *-.), and losing 5 characters introduces an additional 3.9% overhead, similar to how going from 256 characters to 64 for base64 has 33% overhead. It seems like Kanji are similarly not URL safe, and thus violate that requirement too. ↩ -
This table shows the values for all those options, including the results on the example JSON data, sorted by overhead / storage size:
base mode overhead example encoding example QR storage (bytes) example QR version 10 Numeric 0.35% 285 digits 119 14 39 Alphanumeric 3.9% log(45) / log(39) - 1
179 chars 124 14 36 Alphanumeric 6.2% log(45) / log(36) - 1
183 chars 126 15 32 Alphanumeric 10%
5 bits into 1 char (5.5 bits)189 chars 130 15 8
bitsNumeric 11%
3 bits into 1 digit (3.33 bits)315 digits 132 15 8
bytesNumeric 25%
8 bits into 3 digits (10 bits)354 digits 148 15 64 Binary 33% 160 chars 160 16 16 Alphanumeric 37.5%
1 byte (8 bits) into 2 chars (11 bits)236 digits 163 16 There’s two variants of base 8: encoding each byte as its own 3 digit base-8 number (bytes), or ignoring byte boundaries and encoding 3 bits at a time (bits) which is similar to how base 10 and base 36 work. This was suggested by a reader.
Base 39 is maximal URL-safe subset of the Alphanumeric character set2. ↩
-
In the QR code standard (ISO/IEC 18004:2015(E)), Annex J “Optimisation of bit stream length” suggests a method to automatically split up an input stream to optimise the modes, so a sufficiently smart library could just take a single string and arrange the segments itself. ↩