seafowl More sophisticated ETag calculation

More sophisticated ETag calculation

Open mildbyte opened this issue 2 years ago • 2 comments

Follow-up to https://github.com/splitgraph/seafowl/issues/20

Currently, we compute the ETag based on all versions of Seafowl tables in a query. This disregards:

Contents changing when the version doesn't (e.g. using https://www.splitgraph.com/docs/seafowl/guides/baking-dataset-docker-image will always use V1 for all tables, since we're rebuilding the dataset from scratch each time): we should use the hashes of the Parquet files instead?
UDF/built-in function definitions changing
DataFusion query plan / execution changing

Aug 17 '22 11:08 mildbyte

A prolly tree data structure could come in useful here https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-tree-structure.

If a query plan can be evaluated deterministically it could use chunk hashing during each stage of applying the operations to pull results from an internal cache.

Are you familiar with doltdb https://github.com/dolthub/dolt? Essentially I would love to apply those semantics to seafowl!!

Oct 26 '22 19:10 rupurt

Another method: A weak etag with timestamp baked in can then let clients decide whether the information they might fetch from caches is within an acceptable threshold or not. If not, they issue a request that bypasses caches.

Oct 26 '22 21:10 ignoramous

seafowl seafowl copied to clipboard

More sophisticated ETag calculation

seafowl
seafowl copied to clipboard