seafowl icon indicating copy to clipboard operation
seafowl copied to clipboard

Content-addressable object IDs

Open mildbyte opened this issue 2 years ago • 0 comments

Currently, when writing data:

  • we create a region for every partition of the original plan (a row in the physical_region table in the database)
  • the region has a unique ID (just a bigint)
  • the region has an "object storage ID" (path to the physical file)

The ID of the region isn't content-addressable (it always increases), but the object storage ID is. This means that if we're about to write the same Parquet file (same hash), we'll create a new row in the physical_region table (doesn't consume much space) and overwrite the same file in the object storage (doesn't consume space but consumes time uploading the file)

https://github.com/splitgraph/seafowl/blob/f00efc451aaa80a818b42e5d0be72efe39f3f50c/src/context.rs#L340-L356

Figure out:

  • if we want to have a separate "region ID" and "object storage ID"
  • how to skip uploading regions that already exist

mildbyte avatar Jul 19 '22 12:07 mildbyte