streaming icon indicating copy to clipboard operation
streaming copied to clipboard

JointWriter: Allow shard file appending

Open janEbert opened this issue 5 months ago • 2 comments

I am working on a file system that loves few huge files and hates many small files. To this end, I would simply set size_limit=None when creating a dataset using a JointWriter. However, shards are only flushed (data written to disk and freed from RAM) once the size_limit is reached. This means I cannot create shards greater than my RAM (because the data in RAM keeps growing and is never flushed). This becomes especially apparent when I write using multiple processes on the same node.

I'd love it if, even with an unlimited shard_size, shard files could be partially written so that I can create shards larger than RAM. I would personally be fine with only MDSWriter and limited compressions supporting this. It seems like its encode_joint_to_shard implementation could support this.

Is this a feature you would accept contributions for or would it create too much maintenance workload with regard to various settings (compression etc.)?

janEbert avatar Sep 05 '24 08:09 janEbert