Add data caching
Add caching supports for
- BaseDataset (caching dataframe)
- SampleDataset (caching list of dictionaries)
hey Zhenbang, by caching the dataframe do you mean caching the data downloaded from a data source?
Hi @kevinfjiang, thanks for your interest!
This issue should go together with #332. The goal is to improve the memory usage by loading in chunk and off-loading some dataframe to disk.
I'm currently working on issue #332. If you are interested, we'd appreciate your help on caching the SampleDataset (a list of dictionaries). This will allow the user to re-use a processed dataset. Happy to discuss further!
Update on this:
I've merged an initial ability to cache generated task samples in .parquet. It hopefully should work pretty easily.
For dataframe caching, it's not clear if we need this because everything is already lazy-loaded. Most performance runtime is during the task construction here.