kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

write parquet in MultiZarrToZarr

Open martindurant opened this issue 1 year ago • 2 comments

@agoodm

combine currently creates an in-memory dict of all references before writing anything out. With the new parquet formalism, we could replace this dict with a write version of LazyReferenceMapper - maybe we can have a set of dicts that get filled, and when each one if full, write it and drop it, releasing the memory.

Thoughts?

martindurant avatar Mar 27 '23 14:03 martindurant

Thanks for the ping, happy to be kept in the loop on this!

I need some time to think about this a bit more but I think this is a good starting point. A follow up question would be then how we handle writing the references out to parquet. One possible option would be to generalize the logic from refs_to_dataframe into the writable reference mapper object itself. Currently it needs to be given all references to write out all the parquet files, what I am thinking instead is that it can be initialized to extract variable names and metadata, then the parquet files can be written out in streaming fashion after a certain number of references key value pairs are set in the object above a certain threshold (maybe the record_size?) and then deleting them from memory.

agoodm avatar Mar 28 '23 02:03 agoodm

Yes, I agree that a lazy writer for combine and the existing code in refs_to_dataframe at least share a certain amount of logic. The biggest difference, is that combine has whole datasets (or file systems) as the outermost loop, rather than fields/directories.

As you say, both access patterns could probably be served by a generalised lazy writer. You would need a set of objects, each with an independent (paths,offsets,sizes,raws) and count of how many paths have been set so far. The instance can remember the path it's supposed to write to and do write-and-clear when all the references have been set.

martindurant avatar Mar 28 '23 15:03 martindurant