electric icon indicating copy to clipboard operation
electric copied to clipboard

Add a SQLite backed storage implementation

Open magnetised opened this issue 1 year ago • 3 comments

Alternative to the hybrid file + cubdb storage implementation.

Currently implemented with both snapshots and log written to sqlite, but we could always go back to the hybrid model if this is an issue.

To improve test reliability I upped some waits in the typescript client test. This is not because of slowness in the sqlite implementation afaict because the same tests were failing with the old file storage backend.

Naive benchmarking puts this at ~10-100x faster to append to the log than the cubdb version.

magnetised avatar Sep 23 '24 10:09 magnetised

hey, this looks nice! Would love to see the performance of this with our benchmarks. Particularly interested in knowing how it handles lots of shapes and lots of writes

I have to questions:

  • Do you handle log compaction, or have an idea on how we'd implement it?
  • The same questions for chunking. Is that just a range query :D?

balegas avatar Sep 23 '24 11:09 balegas

hey, this looks nice! Would love to see the performance of this with our benchmarks. Particularly interested in knowing how it handles lots of shapes and lots of writes

It's basically a dupe of the cubdb storage, but with sqlite as the db and replacing the pure file-dump snapshot with a table in sqlite. Each shape will have its own db so this is basically the same characteristics as for cubdb but with faster writes (~10-100 microseconds for a log append as opposed to ~4ms)

* Do you handle log compaction, or have an idea on how we'd implement it?

No, haven't touched it, haven't thought about it. We could do it with say replacing within a range in the db. with sqlite we have the advantage of having multiple tables within the file, so there may be some kind of dupe table, compact, rename or overwrite table dance within isolated snapshots. we could use multiple concurrent tables and have some pointer? dunno, lots of options though

* The same questions for chunking. Is that just a range query :D?

I've replicated the cub db version, with chunk log offsets in a separate table and then the same log stream offset stuff.

magnetised avatar Sep 23 '24 11:09 magnetised

@balegas

* Do you handle log compaction, or have an idea on how we'd implement it?

@marc-shapiro, @preguica, and I had a discussion regarding the format of the shape log and log compaction during last last week's on-site. Currently, we store the shape as a sequence of changes, where every change has an offset. After compaction, every PK will have a single entry in the log with the offset of the latest write (insert/update/delete) to that PK. The format where the shape log is stored as a sequence of changes seems good for storing in files on disk.

However, we can have an alternative representation for the shape log, that is more appropriate for storing in a DB. We could have a table per shape that for every PK that is part of the shape tracks its latest value and the "time"/"offset" at which that value was written. When a row is updated or deleted, we update its value and the offset in that shape's table. When a user requests a shape, starting at a given offset, we query the table for all rows with an offset > the provided offset.

The advantage of this approach is that we only ever store the compacted log (as for updates/deletes we effectively update the entry for that row in the DB). Another advantage is that we don't have to implement and maintain our own file-based storage system which is probably quite complex if we want to properly optimize that.

kevin-dp avatar Sep 23 '24 13:09 kevin-dp

We've decided not to move on with this approach while we have other performance gains to work on besides the speed of the storage backend. I'm closing the pull request.

balegas avatar Oct 08 '24 16:10 balegas