kaskada
kaskada copied to clipboard
feat: Use object store and async, byte-range reads
Summary
Currently, compute uses the S3 client to retrieve Parquet files before reading them. We have started a transition to using https://docs.rs/object_store/latest/object_store/ which supports (a) reading from multiple object stores and (b) doing a direct byte-range read without fetching the file locally first.
We should finish up this migration to fully benefit from from object_store
.
- [ ] Call abort_multipart on failure after failures
- [ ] #505
- [ ] Cleanup: Pass object store URLs as
ObjectStoreUrl
rather than&str
orString
- [ ] Consider how we keep the object stores (and credentials) separate in multi tenant case.
- [x] For reading files during compute (#471)
- [x] For writing files during prepare (#475)
- [x] For writing metadata flies during prepare (#475)
- [x] For reading metadata files during compute (#476)
- [x] For determining file schemas (#479)
- [x] For writing CSV files during compute (moved to #486)
- [x] For writing Parquet files during compute (#492)
- [x] For reading files during prepare (rather than copying to disk) (#495)
- [x] Make the
key
method andObjectStoreCrate
private (#501) - [ ] For reading (or at least fetching) the incremental checkpoint (rocksdb) (#503)
- [ ] For writing (or at least uploading) the incremental checkpoint (rocksdb) (#503)
- [ ] For uploading the plan yaml and flight records (#503)
- [ ] Remove s3 helper and s3 crates (#503)
- [ ] Delete
ConvertURI
methods (https://github.com/kaskada-ai/kaskada/blob/main/wren/compute/helpers.go#L19-L23) (#503)
Some of this may be done as part of building the new partitioned execution logic (as part of #409).
I believe this work is complete for the getMetadata() and prepareData() methods, but still needs to be completed on the query execution and materialization code paths.
In general -- started working on this to allow operating on many and/or large files without filling up the disk. First PR(s) are ready for review.
@epinzur re getMetadat()
and prepareData()
-- it isn't really complete for them either. Specifically, they still rely on downloading the whole file. For get metadata, we should only need to fetch the bytes corresponding to the footer, for prepare we should be able to use object store to read bytes in chunks, never fetching the whole thing. Similarly, it isn't used for uploading the file yet.
Capturing some links / thoughts:
- Example of getting the minimum/maximum time from the parquet metadata (the file stats): https://github.com/kaskada-ai/kaskada/blob/7858a62bc26c4ffd2451336d6d4dee82bd393fab/crates/sparrow-runtime/src/metadata/prepared_metadata.rs#L45
- Fetching the schema is likely done by https://github.com/kaskada-ai/kaskada/blob/main/crates/sparrow-runtime/src/metadata/raw_metadata.rs