polars icon indicating copy to clipboard operation
polars copied to clipboard

Use a manager/proxy interface to access temporary storage (temporary directory)

Open nameexhaustion opened this issue 1 year ago • 1 comments
trafficstars

Description

We currently use temporary storage directly through their paths on disk. We want introduce an interface that makes it easier to:

  • Perform file cleanups in a structured manner.
  • Use temporary storage as a local cache for downloaded files/datasets.
    • Need to figure out how to handle invalidation (i.e. remote file was updated)
      • Use cloud file metadata field?
      • Maybe have a config to fully invalidate all caches e.g. POLARS_INVALIDATE_CACHES

Usage scenarios

There are a few different ways we might use this

  • Caching downloaded cloud files
  • Spilling from operators (e.g. from out-of-core group-by)

nameexhaustion avatar May 15 '24 08:05 nameexhaustion

This might run on our tokio runtime. Then we could static task (runs for the duration of the polars process) that most of the time sleeps and once in a while garbage collects.

ritchie46 avatar May 15 '24 09:05 ritchie46

Alright, did a brainstorm. I think we have got some ideas.

Assuming our spill/cache directory ~.polars/.

We can dump spilled files under a folder created by a combination process id and current datetime. This can hold future spilling files.

For the caching of the files we should provide a time-to-live, TTL. This TTL can for instance be 1 day for files downloaded from the internet.

During startup we create a task that checks for old pid_datetime folders that are not alive anymore (interupted process) and files that surpassed their TTL and cleans them.

~/.polars/
    # Spills from the streaming engine. For future reference
    pid_datetime/
    pid_datetime/
    # files with a TTL
    cache/

The spill manager can be a static struct that initially only deals with the downloads, caching and cleanup. I think that we should set an in-process bit during downloading so that we don't start duplicate downloads.

ritchie46 avatar May 21 '24 11:05 ritchie46

Would it make sense to automatically delete caches belonging to the current process on exit, even in the case of a crash? (Rather than waiting for the next time polars is run?)

daviewales avatar Feb 12 '25 04:02 daviewales