streaming icon indicating copy to clipboard operation
streaming copied to clipboard

A Data Streaming Library for Efficient Neural Network Training

Results 61 streaming issues
Sort by recently updated
recently updated
newest added

## Environment - OS: Ubuntu 22.04 - Python: 3.10.13 - mosaicml-streaming: 0.7.4 ## To reproduce ```python from streaming import StreamingDataset data = StreamingDataset( remote="gs://", local="/tmp/data", split="validation", batch_size=1024, allow_unsafe_types=True, ) for...

bug

You can trigger a rebase of this PR by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting...

dependencies

…or arg ## Description of changes: Modify StreamingDataset to support passing process_group as a constructor argument. Currently, StreamingDataset assumes it should use the default process group; however, for certain use...

## Description of changes: Make merge_index utility run in parallel with multiprocessing. Note the normal use case for merge index happens after mds shards are written to a number of...

## 🚀 Feature Request Large ```index.json``` are slow to load. Currently, I am trying to increase shard size, so stream.py#L473 will be faster (hopefully). ## Motivation These two steps are...

enhancement

## To reproduce calling `clean_stale_shared_memory()` at the beginning of a `train.py` script itself launched with composer in a distributed setup. ## Expected behavior The memory is cleaned at the beginning...

bug

## Environment - OS: Ubuntu 22.04 ## To reproduce Steps to reproduce the behavior: When using the `StreamingDataloader` (or the vanilla pytorch `Dataloader`) with `num_workers>0`, the processes slowly take more...

bug

## 🚀 Feature Request When I use multiple Streams to create a StreamingDataset, I want to be able to use a different pre-processing function to process the data in each...

enhancement

## 🚀 Feature Request Hey folks - I've loved using `streaming` for some of my research in multimodal pretraining and robotics. One thing I'd love to support is first-class integration...

enhancement

Hiya! I have approximately 1k data streams, each containing pickled numpy arrays. When data is loaded, I need to sample a subsequence from it, so my dataloader looks like this:...