seafowl
seafowl copied to clipboard
More sophisticated ETag calculation
Follow-up to https://github.com/splitgraph/seafowl/issues/20
Currently, we compute the ETag based on all versions of Seafowl tables in a query. This disregards:
- Contents changing when the version doesn't (e.g. using https://www.splitgraph.com/docs/seafowl/guides/baking-dataset-docker-image will always use V1 for all tables, since we're rebuilding the dataset from scratch each time): we should use the hashes of the Parquet files instead?
- UDF/built-in function definitions changing
- DataFusion query plan / execution changing
A prolly tree data structure could come in useful here https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-tree-structure.
If a query plan can be evaluated deterministically it could use chunk hashing during each stage of applying the operations to pull results from an internal cache.
Are you familiar with doltdb https://github.com/dolthub/dolt? Essentially I would love to apply those semantics to seafowl!!
Another method: A weak etag with timestamp baked in can then let clients decide whether the information they might fetch from caches is within an acceptable threshold or not. If not, they issue a request that bypasses caches.