polars
polars copied to clipboard
Use a manager/proxy interface to access temporary storage (temporary directory)
Description
We currently use temporary storage directly through their paths on disk. We want introduce an interface that makes it easier to:
- Perform file cleanups in a structured manner.
- Use temporary storage as a local cache for downloaded files/datasets.
- Need to figure out how to handle invalidation (i.e. remote file was updated)
- Use cloud file metadata field?
- Maybe have a config to fully invalidate all caches e.g.
POLARS_INVALIDATE_CACHES
- Need to figure out how to handle invalidation (i.e. remote file was updated)
Usage scenarios
There are a few different ways we might use this
- Caching downloaded cloud files
- Spilling from operators (e.g. from out-of-core group-by)
This might run on our tokio runtime. Then we could static task (runs for the duration of the polars process) that most of the time sleeps and once in a while garbage collects.
Alright, did a brainstorm. I think we have got some ideas.
Assuming our spill/cache directory ~.polars/.
We can dump spilled files under a folder created by a combination process id and current datetime. This can hold future spilling files.
For the caching of the files we should provide a time-to-live, TTL. This TTL can for instance be 1 day for files downloaded from the internet.
During startup we create a task that checks for old pid_datetime folders that are not alive anymore (interupted process) and files that surpassed their TTL and cleans them.
~/.polars/
# Spills from the streaming engine. For future reference
pid_datetime/
pid_datetime/
# files with a TTL
cache/
The spill manager can be a static struct that initially only deals with the downloads, caching and cleanup. I think that we should set an in-process bit during downloading so that we don't start duplicate downloads.
Would it make sense to automatically delete caches belonging to the current process on exit, even in the case of a crash? (Rather than waiting for the next time polars is run?)