streaming
streaming copied to clipboard
A Data Streaming Library for Efficient Neural Network Training
https://github.com/mosaicml/streaming/blob/0b055ffcbce130ea7c5a99cd35fe2ec7702af4ac/streaming/base/spanner.py#L52 Hello, curious if it might be more customary to use python's `IndexError` instead of your custom `ValueError` when an index is out of bounds in `__getitem__`. One consequence of...
## Description of changes: ## Issue #, if available: ## Merge Checklist: _Put an `x` without space in the boxes that apply. If you are unsure about any checklist, please...
## Description of changes: Very rarely we see ready_thread assigned a higher priority when num_workers > 1. The observation is that ready_thread progresses way faster than preaprae_thread. It is unknown...
## 🚀 Feature Request ## Motivation When converting a huge dataset, I would like to resume the conversion process when it failed on the way. ## [Optional] Implementation ## Additional...
## 🚀 Feature Request Number of shards that would be created, estimated with help of size_limit and data size can be a useful metric. ## Motivation If in future, other...
Do I understand correctly that the cache_limit parameter only works for MDS shards and does not index [extra_local](https://github.com/mosaicml/streaming/blob/5f939c9057b041f10342dfc5744d2d3880e3f14b/streaming/multimodal/webvid.py#L207) for downloading videos? https://github.com/mosaicml/streaming/blob/5f939c9057b041f10342dfc5744d2d3880e3f14b/streaming/multimodal/webvid.py#L210 If so, is it possible to clear the...
## Environment - mosaicml-streaming==0.7.5 ## To reproduce Steps to reproduce the behavior: 1. Use `StreamingDataset` in distributed training with the same seed and set `replication` either to None or an...
### Background: Our data is quite large and varies in size. With a size limit of 100 MB, there will only be 8 or 9 samples per shard. I have...
## Environment - OS: Windows 11 ## To reproduce Steps to reproduce the behavior: 1. Upload and convert a local webdataset using MDS writer like the following code produces huge...
## Environment - OS: Debian 12 on GCE - Hardware (GPU, or instance type): N4 ## To reproduce Steps to reproduce the behavior: 1. Run inside of a GCE machine...