seafowl
seafowl copied to clipboard
Content-addressable object IDs
Currently, when writing data:
- we create a region for every partition of the original plan (a row in the
physical_region
table in the database) - the region has a unique ID (just a bigint)
- the region has an "object storage ID" (path to the physical file)
The ID of the region isn't content-addressable (it always increases), but the object storage ID is. This means that if we're about to write the same Parquet file (same hash), we'll create a new row in the physical_region
table (doesn't consume much space) and overwrite the same file in the object storage (doesn't consume space but consumes time uploading the file)
https://github.com/splitgraph/seafowl/blob/f00efc451aaa80a818b42e5d0be72efe39f3f50c/src/context.rs#L340-L356
Figure out:
- if we want to have a separate "region ID" and "object storage ID"
- how to skip uploading regions that already exist