streaming
streaming copied to clipboard
[Question] StreamingOutsideDTWebVid cache_limit for video
Do I understand correctly that the cache_limit parameter only works for MDS shards and does not index extra_local for downloading videos?
https://github.com/mosaicml/streaming/blob/5f939c9057b041f10342dfc5744d2d3880e3f14b/streaming/multimodal/webvid.py#L210
If so, is it possible to clear the folder from old videos using an additional cache_limit ?
Best Regard, Alex.
Yes, your understanding is correct. You could track the cache usage similar to how we already do it in StreamingDataset (here), but may face issues with eviction since the videos are not stored as streaming shards.
The cache limit works for any data shards written with streaming, including MDS / CSV / JSONL.
Thanks for the answer! I have another question, is it possible to work with ZIP or TAR archives as it is implemented in the webdataset library?
https://github.com/webdataset/webdataset
Hey, we don't offer direct support for zip or tar since Streaming requires the data to be in a predictable format (as written by our Writer classes, such as MDSWriter, JSONWriter, CSVWriter, etc) so that the dataset knows how to access particular samples with very low latency.
However, the Writer classes do offer compression of shards if you want to save space. See this section in the docs for more info on writing datasets & compression.