streaming icon indicating copy to clipboard operation
streaming copied to clipboard

[Question] StreamingOutsideDTWebVid cache_limit for video

Open nagadit opened this issue 1 year ago • 2 comments

Do I understand correctly that the cache_limit parameter only works for MDS shards and does not index extra_local for downloading videos?

https://github.com/mosaicml/streaming/blob/5f939c9057b041f10342dfc5744d2d3880e3f14b/streaming/multimodal/webvid.py#L210

If so, is it possible to clear the folder from old videos using an additional cache_limit ?

Best Regard, Alex.

nagadit avatar Jul 22 '24 19:07 nagadit

Yes, your understanding is correct. You could track the cache usage similar to how we already do it in StreamingDataset (here), but may face issues with eviction since the videos are not stored as streaming shards.

The cache limit works for any data shards written with streaming, including MDS / CSV / JSONL.

snarayan21 avatar Jul 23 '24 09:07 snarayan21

Thanks for the answer! I have another question, is it possible to work with ZIP or TAR archives as it is implemented in the webdataset library?

https://github.com/webdataset/webdataset

nagadit avatar Aug 01 '24 09:08 nagadit

Hey, we don't offer direct support for zip or tar since Streaming requires the data to be in a predictable format (as written by our Writer classes, such as MDSWriter, JSONWriter, CSVWriter, etc) so that the dataset knows how to access particular samples with very low latency.

However, the Writer classes do offer compression of shards if you want to save space. See this section in the docs for more info on writing datasets & compression.

snarayan21 avatar Sep 16 '24 14:09 snarayan21