SpacetimeDB [STORAGE USE REDUCTION] Commitlog segment compression

Aug 15 '24 15:08 cloutiertyler

Please keep in mind that closed commitlog segments may need to be accessed by tx offset, at least until the most recent snapshot. We maintain an offset index, mapping tx- to byte-offsets, for that purpose.

So either we compress segments when moving them to cold storage (i.e. only those before a snapshot). Or, the index should remain functional. This can be achieved by compressing segments using zstd-seekable, which adds a dictionary allowing to seek to the original (uncompressed) byte offsets. Since zstd also employs a magic byte sequence, we don't even need to change anything about the commitlog format.

Oct 09 '24 06:10 kim

It was decided that we can do this later in a non-breaking way by looking for either our magic number or the zstd magic number to determine whether this segment file is compressed or not.

Oct 10 '24 18:10 cloutiertyler

MVP / definition of done, as I see it:

Determine some time at which to compress old commitlog segments. Possibly:
- Whenever a commitlog segment is filled and a new segment is started, compress the old segment. Don't worry about the snapshots.
- After taking a snapshot, compress all segments older than the one containing the snapshotted TX. Keep the segment(s) needed to replay from the most recent snapshot uncompressed.
When replaying from the commitlog, if a segment is compressed, decompress it in-memory. Do not store the uncompressed version to disk.
Benchmark to ensure that replaying from the most recent snapshot is not catastrophically slower. We do not care if replaying older snapshots is slow. We also can afford a small regression even on the most recent snapshot.
Test to ensure that replaying from an older snapshot, or no snapshot at all, is still possible, even if it is slow.

Nov 20 '24 19:11 gefjon

Test that traversing a compressed segment from an offset that is not the start of the segment is not catastrophically slower (e.g. by having to decompress and traverse from the start of the segment, instead of seeking using the offset index).

Nov 21 '24 06:11 kim

Test that traversing a compressed segment from an offset that is not the start of the segment is not catastrophically slower (e.g. by having to decompress and traverse from the start of the segment, instead of seeking using the offset index).

I contend that we don't actually care, as long as replaying from the most recent snapshot is still fast. I am not aware of any other performance-constrained case in which we traverse commitlog segments.

Nov 21 '24 14:11 gefjon

@gefjon replication will need to be able to randomly seek in segments, at least back to the latest snapshot.

Nov 21 '24 17:11 kim

at least back to the latest snapshot.

Ack. This is a significantly weaker constraint than the one you wrote originally. E.g. I believe we would accept a solution where replaying from or seeking within a compressed segment was slow, but where the segment(s) after the most recent snapshot were kept uncompressed and were therefore fast.

Nov 21 '24 17:11 gefjon

Left the server running overnight at 60Hz (scheduled_at = new ScheduleAt.Interval(TimeSpan.FromTicks(TimeSpan.TicksPerSecond / 60))) with a single object with two components. Before:

$ select * from transforms
 id   | gameObjectId | pos                              | sequenceNumber 
------+--------------+----------------------------------+----------------
 4136 | 4136         | (x = -119.70513, y = -324.86374) | 401

$ select * from rigidbodies
 id   | gameObjectId | velocity                          | acceleration   | mass | sequenceNumber 
------+--------------+-----------------------------------+----------------+------+----------------
 4136 | 4136         | (x = -3.4114923, y = -0.13953304) | (x = 0, y = 0) | 10   | 401

After:

$ select * from transforms
 id   | gameObjectId | pos                             | sequenceNumber 
------+--------------+---------------------------------+----------------
 4136 | 4136         | (x = -64775.79, y = -2972.1108) | 401

And I was impressed by the stability, seems to work perfectly fine still, but as I noticed these messages:

2025-03-15T12:04:41.754998Z  INFO crates/core/src/db/datastore/locking_tx_datastore/datastore.rs:263: Capturing snapshot of database Identity(xxx) at TX offset 4000000    
2025-03-15T12:04:41.760369Z  INFO /home/ubuntu/actions-runner-linux-x64-2.309.0/_work/SpacetimeDB/SpacetimeDB/crates/snapshot/src/lib.rs:561: [xxx] SNAPSHOT 00000000000004000000: Hardlinked 9 objects and wrote 9 objects

...it made me wonder about the disk usage. And oh boy, how is it possible that the data required for a single row in a couple of tables can lead to this? I was expecting to see kilobytes at max, maybe bytes as I have no heavy indexing in use yet either.

$ du .local/share/spacetime/data/* -shc
120M	.local/share/spacetime/data/cache
4.0K	.local/share/spacetime/data/config.toml
2.6M	.local/share/spacetime/data/control-db
424K	.local/share/spacetime/data/logs
4.0K	.local/share/spacetime/data/metadata.toml
34M	.local/share/spacetime/data/program-bytes
1.2G	.local/share/spacetime/data/replicas
4.0K	.local/share/spacetime/data/spacetime.pid
1.4G	total

$ du .local/share/spacetime/data/replicas/* -shc
6.4M	.local/share/spacetime/data/replicas/2000001
6.4M	.local/share/spacetime/data/replicas/2000003
6.4M	.local/share/spacetime/data/replicas/2000005
7.8M	.local/share/spacetime/data/replicas/2000007
7.7M	.local/share/spacetime/data/replicas/4000001
6.4M	.local/share/spacetime/data/replicas/4000003
66M	.local/share/spacetime/data/replicas/4000005
118M	.local/share/spacetime/data/replicas/4000007
8.4M	.local/share/spacetime/data/replicas/4000009
138M	.local/share/spacetime/data/replicas/4000011
19M	.local/share/spacetime/data/replicas/4000013
44M	.local/share/spacetime/data/replicas/4000015
766M	.local/share/spacetime/data/replicas/4000017
1.2G	total

$ du -shc .local/share/spacetime/data/replicas/4000017/*
761M	.local/share/spacetime/data/replicas/4000017/clog
0	.local/share/spacetime/data/replicas/4000017/db.lock
36K	.local/share/spacetime/data/replicas/4000017/module_logs
5.3M	.local/share/spacetime/data/replicas/4000017/snapshots
766M	total

A 1 GB overhead can be fine if e.g. that means it will stay around that 1 GB for a long time, but I have a feeling if there had been 1000 objects running overnight, I would have came back to GRUB and spent the morning hacking together a bootable ISO to unfuck my volume 😆

On an another but likely related note, what is the expected startup time for e.g. that gigabyte of replicas? It took minutes for what is outlined above and the user experience left to be desired, as the service seemed to be "up and ready", but wasn't:

# startup, seems up and ready to go
2025-03-15T12:32:17.868092Z DEBUG /home/ubuntu/actions-runner-linux-x64-2.309.0/_work/SpacetimeDB/SpacetimeDB/crates/standalone/src/subcommands/start.rs:145: Starting SpacetimeDB listening on 127.0.0.1:3000
# $ spacetime sql ...
2025-03-15T12:33:02.427096Z DEBUG /home/ubuntu/actions-runner-linux-x64-2.309.0/_work/SpacetimeDB/SpacetimeDB/crates/client-api/src/routes/database.rs:382: auth: AuthCtx { owner: Identity(xxx), caller: Identity(xxx) }
# finally something is happening 2 minutes after it reported it was listening (and it was but not responding)!
2025-03-15T12:34:35.696716Z  INFO /home/ubuntu/actions-runner-linux-x64-2.309.0/_work/SpacetimeDB/SpacetimeDB/crates/core/src/db/relational_db.rs:312: [xxx] DATABASE: durable_tx_offset is Some(4082620)
# done loading (seems ~instant)
2025-03-15T12:34:35.972922Z  INFO /home/ubuntu/actions-runner-linux-x64-2.309.0/_work/SpacetimeDB/SpacetimeDB/crates/core/src/db/relational_db.rs:1249: [xxx] DATABASE: rebuilt state after replay
# actually up and ready
2025-03-15T12:34:36.093491Z DEBUG /home/ubuntu/actions-runner-linux-x64-2.309.0/_work/SpacetimeDB/SpacetimeDB/crates/client-api/src/routes/subscribe.rs:144: New client connected from unknown ip

I suggest editing the initial message and perhaps adding some extra info such as

Starting SpacetimeDB listener on 127.0.0.1:3000 (NOT READY)
Initializing database (NOT READY)
<tell me what's happening with these huge tasks between messages>
[xxx] DATABASE: durable_tx_offset is Some(4082620)
...
[xxx] DATABASE: rebuilt state after replay
Database is now ready for connections (READY)

for improved new developer experience :)

Mar 15 '25 13:03 puttehi

Updated plan for initial version:

When taking a snapshot, compress all commitlog segments older than the one containing the snapshotted TX
Keep the segment(s) needed to replay from the most recent snapshot uncompressed
Leave snapshots uncompressed
Use zstd --format=seekable (replication requires tx offset seeking)

Mar 18 '25 19:03 joshua-spacetime