streaming icon indicating copy to clipboard operation
streaming copied to clipboard

A Data Streaming Library for Efficient Neural Network Training

Results 88 streaming issues
Sort by recently updated
recently updated
newest added

## Environment I don't believe this bug is environment dependent ## To reproduce This line of code: https://github.com/mosaicml/streaming/blob/v0.5.2/streaming/base/format/mds/encodings.py#L417 assumes that `Image`s created from byte-streams will have `hasattr(obj, 'filename') == False`,...

bug

The collection of MDS types in Streaming is an evolved ad hoc system of pragmatic one-offs arrived at reactively to solve problems. *In this organically emerged state of primordial chaos,...

## 🚀 Feature Request The current supported encodings are listed here: https://github.com/mosaicml/streaming/blob/59f6ec5f8f97cc5f9a75954fef4bef3221460ff8/streaming/base/format/mds/encodings.py#L270 I would like to have support for lists, i.e. columns that are lists of integers, jpegs, etc. Theses...

enhancement

## Description of changes: Download each stream's index in a different thread at the same time. ## Issue #, if available: ## Merge Checklist: _Put an `x` without space in...

## Description of changes: Implement a shared lock. - 25x faster than FileLock - No filesystem cleanup afterward - Lives in a few bytes of shared memory - Uses pthread_mutexattr_setpshared...

Hi, Thanks for the lib! I think the 2 main alternatives in the pytorch world are webdataset and torch data. They both support tar files as shard format. The benefit...

enhancement

## 🚀 Feature Request I want to use `streaming` to access a remote datacenter via ssh with certain privacy-related permission. ## Motivation I want to use a separate cluster to...

enhancement

## Description of changes: When it comes to partitioning, we have the need for speed. Using the original partitioning implementation as guide, we replicate correct partitioning behavior in pure numpy...

**Environment** - OS: [Ubuntu 20.04] - Hardware (GPU, or instance type): [A100] >= 2 GPUs **To reproduce** Steps to reproduce the behavior: When trying to run [examples/bert mlm training](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training) (using...

bug

## Environment - OS: [Ubuntu 20.04] - Hardware (GPU, or instance type): [H100] When I try to load a big dataset with ~thousands of shards (each shard is ~1GB), on...

bug