PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

Add data caching

Open zzachw opened this issue 9 months ago • 4 comments

Add caching supports for

  • BaseDataset (caching dataframe)
  • SampleDataset (caching list of dictionaries)

zzachw avatar Apr 09 '25 20:04 zzachw

hey Zhenbang, by caching the dataframe do you mean caching the data downloaded from a data source?

kevinfjiang avatar Jun 05 '25 00:06 kevinfjiang

Hi @kevinfjiang, thanks for your interest!

This issue should go together with #332. The goal is to improve the memory usage by loading in chunk and off-loading some dataframe to disk.

zzachw avatar Jun 10 '25 15:06 zzachw

I'm currently working on issue #332. If you are interested, we'd appreciate your help on caching the SampleDataset (a list of dictionaries). This will allow the user to re-use a processed dataset. Happy to discuss further!

zzachw avatar Jun 10 '25 15:06 zzachw

Update on this:

I've merged an initial ability to cache generated task samples in .parquet. It hopefully should work pretty easily.

For dataframe caching, it's not clear if we need this because everything is already lazy-loaded. Most performance runtime is during the task construction here.

jhnwu3 avatar Sep 23 '25 18:09 jhnwu3