RFC: Migrate the session ID to ULID / UUIDv7
The session ID is currently created by generating 20 random bytes using a CSPRNG and encoded as Base16. This creates unpredictable IDs that make it impossible to guess them. However, this has the significant downside that they make for a poor primary key since insertions happen at arbitrary locations in the tree which yields a bad insert performance with InnoDB’s primary key handling.
ULID
We can solve this by migrating to ULID which offers significant benefits for our use case:
- It uses 48 bits for the timestamp with millisecond precisions while still allowing for 1.21e+24 ULIDs for any given millisecond.
- 80 bits of randomness are still way enough to make them unpredictable even when the generation millisecond is known.
- They are lexicographically sortable which plays very nice with InnoDB’s primary key.
- Encoded in Crockford Base32 this yields 26 characters which is even shorter than our current session ID.
There are two PHP libraries that look promising, https://github.com/robinvdvleuten/php-ulid (453 stars, but not updated since 2020) and https://github.com/ramsey/identifier (57 stars, v0.1.0 on June 19, but requires PHP 8.2+, known PHP developer).
UUID v7
The newer spec for UUID v7 solves the same issues that ULID:
- The first 48 bits are used to to encode the generation UNIX timestamp in milliseconds.
- 74 bits of randomness which still offers plenty of randomness to make it unpredictable.
- Compatible with other UUIDs because it uses the same format (debatable wether this is relevant to us).
- Official standard as outlined in RFC 9562.
We can utilize https://github.com/ramsey/uuid which has 12.5k stars and is also maintained by Ben Ramsey. There is also https://github.com/symfony/uid which supports UUID v7 but has only 586 stars.
UUID v7 is probably the better pick due to it being an official standard at this point.
Migration Path
The migration path can be a bit difficult because of the foreign key constraints. Thankfully the cookie format supports versioning which enables us to introduce a v2 with a different format.
After some internal discussion, we came to the conclusion that it would be easier and more efficient to use a custom format that also fits into the existing CHAR(40) column to make the migration easier.
In general we want at least 128 bits of randomness as this is the general consensus for something that is secure and not guessable in practice. In addition, the cookie is cryptographically signed which adds another layer of defense on top.
Moving away from a pure Base16 encoding allows us to use the space more efficiently.
Concept for a Custom Format
<timestamp><milliseconds><randomData>V<version>
- The timestamp is encoded as u32 and encoded as Base16.
- Milliseconds are encoded as a time step
\intdiv(\substr(\explode(' ', microtime(false))[0], 2, 6), 3_921)yielding values between 0-255, which can be interpreted as u8 and encoded as Base16. - The string ends with
Vxxwherexxis the version number as u8 encoded as Base16. - The version marker allows us to detect older sessionIDs that can then be seamlessly migrated to the new format. The
Vserves as a marker because the previous ID was Base16 encoded and adds a path format for new migrations.
This requires a total of 13 characters, leaving 27 characters for the random data. Using Base64 we can generate 20 bytes of random data, giving us 160 bits of randomness. The resulting string is 28 characters long and ends with a padding = which we can cut off since we don’t care about decoding it anyway.
Base62 for Random Data
Unfortunately, Base64 yields + and / which are annoying to deal with when trying to copy it with a double click. Since we’re way above 128 bits already, we can give up some and instead use Base62 encoding which almost the same as Base64 but does not use + and /.
Using https://gist.github.com/Synchro/1139429 as a reference, this allows us to encode 19 bytes using 26 characters yielding 152 bits of entropy. We can encode 20 bytes (yields 28 characters) which can be truncated at the right by one character, leaving us with effectively 156 bits of entropy.
Using Base16 for Numbers
We want to keep the timestamp and time step for the milliseconds encoded in Base16 because it plays nicely with MySQL’s primary key behavior since the values follow an order, greatly reducing the need to move rows around.
Encoding of the Version Number
The above proposal uses 2 characters to encode the version number using an u8. Since it is unlikely we will ever hit that limit we could truncate it to 1 character, effectively reducing it to a u4 value. Since 0-1 are considered to be burned and 2 being the first version to use this marker, this leaves us with 13 possible follow-up versions.
This has the benefit of increasing the number of characters available for the random to an even 28 characters, bumping us back up to 160 bits of randomness when using Base62.
Crockford’s Base32
An alternative is Crockford’s Base32 which avoids characters that can be confused and is only slightly less efficient. An example implementation can be found at https://gist.github.com/joelharkes/82ac4eb0fd8226649a6118562e3817c4
28 characters allow us to encode 17 random bytes which gives us 136 bit of randomness. Compared to Base62 this is a lot less of random data, but is much faster and avoids using GMP which is a requirement for Base62 in PHP.
This is possibly the sweet spot that satisfies the 128 bit threshold for randomness while being simple to implement.
I was curious and tested the performance impact of inserting 10k sessions. All tests were performed using PHP 8.3 on an M1 Pro using MySQL 9.3.
Baseline
In total this took around ~3s with no system load. Generating the IDs using \random_bytes(20) and constant time encoding yields accounts for ~72ms.
The numbers were taking after filling the session table with 100k randomly generated IDs prior to running the tests.
New Format
My test implementation uses this monstrosity to generate the session ID using the V2 format using Base32:
$sessionID = \sprintf(
'%s%sV%s',
Hex::encode(\pack(
'NC',
\TIME_NOW,
\intdiv(\substr(\explode(' ', microtime(false))[0], 2, 6), 3_921)
)),
\substr(Base32::encode(\random_bytes(17)), 0, 28),
\substr(\bin2hex(\pack('C', 2)), 1),
);
Some examples:
687260e3c6lygg5g7fl7gafj2a2vnwmatncbsqV2
687260e8cfec2wxvtkbsan3zhj6473dni54zzaV2
687260edf1nhpcdgujmdh6br6dx5a75kdp7osaV2
687260f248qr6dtivuywzponc6uldo62niexbaV2
687260f6c6md74wbpyrioxigyh562cemkru3zqV2
Performance of Generating IDs
Generating 10k such IDs takes about ~102ms which is an almost 41% increase from the previous session ID. That sounds a lot but is an acceptable increase given the increased complexity. I have not been able to utilize %x with for the version number since the width parameter only works from the left. Considering this is the execution time for 102ms, it’s not worth it to spend any more time optimizing it.
Performance of INSERT
These numbers were taking after filling the session table with 100k randomly generated IDs prior to running the tests. The insert performance improved a bit with 2.7s on average. That is a 10% improvement but much less than anticipated.
Using BINARY(40) for the sessionID
This reduces the insert performance to ~2.2s on average which is a massive improvement over both prior implementations. I compared this to the existing implementation and saw only a minor but consistent improvement with insert times hovering around 2.9s.
phpMyAdmin natively attempts to display BINARY as ASCII which works well with our generated IDs. Adminer requires an extra CONVERT(sessionID USING ascii) to display the values because it defaults to Base16 for binary columns.
I did some further testing to see how much improvement is possible there and changed the sessionID to be a simple i32 without an auto increment, with the values being 1 to 10,000. The insert performance was around ~2.1s which means that achieving anything close to that number is already the best case.
Performance Comparison
I set up a test for 100,000 sequential inserts operations in a simple for loop which is intended to reduce run-to-run variation and gives us better numbers. There may be still some variation but at the end of the day the relative performance matters.
The table was truncated before each run.
| Method | Column Definition | Runtime | Change |
|---|---|---|---|
| 20 Bytes, Base16 | CHAR(40) |
25.79s | n/a |
| 20 Bytes (raw) | BINARY(20) |
18.80s | -27% |
| Proposed sessionID | CHAR(40) |
18.45s | -28% |
| Ascending integer | INT |
17.79s | -31% |
The “Ascending integer” is the best case scenario and as we can see, the proposed change comes quite close to it. There is another huge improvement when switching the existing 20 bytes from Base16 to just store the raw bytes using BINARY(20).
There is still a slight performance improvement of using the new format which could indicate that the ordered nature of the IDs provide some benefits here. In addition, the new format remains human readable and does not require a different column format by utilizing the existing CHAR(40). At a 4% difference this is not worth any further optimizations.
Digging a bit deeper, I figured out that the primary key in InnoDB is not a B-tree but a B+-tree which scales with the size of the key and the order. This does explain the performance improvement of using BINARY(20).