streaming
streaming copied to clipboard
JointWriter: Allow shard file appending
I am working on a file system that loves few huge files and hates many small files. To this end, I would simply set size_limit=None
when creating a dataset using a JointWriter
. However, shards are only flushed (data written to disk and freed from RAM) once the size_limit
is reached. This means I cannot create shards greater than my RAM (because the data in RAM keeps growing and is never flushed). This becomes especially apparent when I write using multiple processes on the same node.
I'd love it if, even with an unlimited shard_size
, shard files could be partially written so that I can create shards larger than RAM. I would personally be fine with only MDSWriter
and limited compressions supporting this. It seems like its encode_joint_to_shard
implementation could support this.
Is this a feature you would accept contributions for or would it create too much maintenance workload with regard to various settings (compression etc.)?