kaskada
kaskada copied to clipboard
feat: allow preparing & output to an object store when running locally in a notebook
Summary If I'm running the Kaskada engine locally in a notebook, and I'm working with larger datasets, it would be better if I could use remote object storage for the prepare cache and query results output.
Therefore I wouldn't need to worry about filling up my local disk.
Is your feature request related to a problem? Please describe. This is related to trying to compute on large datasets (1+ TB), when the available local storage for my notebook is smaller than the datasize. This could be for working on a local machine or working from a hosted platform like google colab. The default local disk size for google colab is 80 GB.
Describe the solution you'd like The manager and engine already support using remote object storage for the prepare cache and query output storage.
The python client should be updated to allow creating a local session with the following ENV params specified on manager startup:
-
OBJECT_STORE_TYPE
: eithers3
orgs
-
OBJECT_STORE_BUCKET
: the name of the bucket -
OBJECT_STORE_PATH
: the path in the bucket to store all data in
Describe alternatives you've considered
- Google Colab offers 400GB local disk for an additional fee, but this is still too small for my use case.
- I could slice the data to 1%, but I'd like to eventually compute over the full data set.
Two thoughts:
- I could see a case where the user wants to separate prepare and output, since they have different roles. Ideally this could be separate options.
- Why is it three environment variables. It seems like we've started moving towards specifying
s3://<bucket>/<path>
as a single option, allowing the user to provide all three with a single URL.
Eg: prepare_prefix_url='s3://<bucket>/prepare/', output_prefix_url='file:///tmp/path/to/local/output'
could be used to prepare to S3 but output locally to the /tmp
directory.
Also probably eventually want an option for controlling where the snapshots (rocksdb) are written.
currently there is just "storage owned by kaskada". this includes prepare cache, rocks snapshots, output files, compute traces, etc...
I think separating to multiple storage locations should be a separate issue.
I think it depends on whether it is possible for the user to specify today (eg., by passing extra arguments to Wren via the session builder). If that's the case, then we may want to defer making any API changes until we have a plan for what the API should be, and treat the extra arguments as a way to accomplish this in the time being.
If API changes are necessary, then we should discuss further.