streaming issues

Passing jpegs constructed from byte streams crashes with `FileNotFoundError: [Errno 2] No such file or directory: ''`

2

## Environment I don't believe this bug is environment dependent ## To reproduce This line of code: https://github.com/mosaicml/streaming/blob/v0.5.2/streaming/base/format/mds/encodings.py#L417 assumes that `Image`s created from byte-streams will have `hasattr(obj, 'filename') == False`,...

davidabrahams1

bug

Experiment: much more powerful MDS type system ("DBS").

The collection of MDS types in Streaming is an evolved ad hoc system of pragmatic one-offs arrived at reactively to solve problems. *In this organically emerged state of primordial chaos,...

knighton

Support for lists for supported encodings

11

## 🚀 Feature Request The current supported encodings are listed here: https://github.com/mosaicml/streaming/blob/59f6ec5f8f97cc5f9a75954fef4bef3221460ff8/streaming/base/format/mds/encodings.py#L270 I would like to have support for lists, i.e. columns that are lists of integers, jpegs, etc. Theses...

VictorSanh

enhancement

Parallelize StreamingDataset index downloads.

## Description of changes: Download each stream's index in a different thread at the same time. ## Issue #, if available: ## Merge Checklist: _Put an `x` without space in...

knighton

Shared lock

## Description of changes: Implement a shared lock. - 25x faster than FileLock - No filesystem cleanup afterward - Lives in a few bytes of shared memory - Uses pthread_mutexattr_setpshared...

knighton

MDS: standard ?

6

Hi, Thanks for the lib! I think the 2 main alternatives in the pytorch world are webdataset and torch data. They both support tar files as shard format. The benefit...

rom1504

enhancement

Streaming via ssh across clusters

2

## 🚀 Feature Request I want to use `streaming` to access a remote datacenter via ssh with certain privacy-related permission. ## Motivation I want to use a separate cluster to...

gaow0007

enhancement

Redesign partitioning algorithm

2

## Description of changes: When it comes to partitioning, we have the need for speed. Using the original partitioning implementation as guide, we replicate correct partitioning behavior in pure numpy...

knighton

CUDA initialization error at raw_delete when crossing epoch boundary

2

**Environment** - OS: [Ubuntu 20.04] - Hardware (GPU, or instance type): [A100] >= 2 GPUs **To reproduce** Steps to reproduce the behavior: When trying to run [examples/bert mlm training](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training) (using...

karan6181

bug

Last entry in the dataset is causing "Relative sample index $x is not present" error

3

## Environment - OS: [Ubuntu 20.04] - Hardware (GPU, or instance type): [H100] When I try to load a big dataset with ~thousands of shards (each shard is ~1GB), on...

isidentical

bug

streaming
streaming copied to clipboard

Metadata

Passing jpegs constructed from byte streams crashes with `FileNotFoundError: [Errno 2] No such file or directory: ''`

Experiment: much more powerful MDS type system ("DBS").

Support for lists for supported encodings

Parallelize StreamingDataset index downloads.

Shared lock

MDS: standard ?

Streaming via ssh across clusters

Redesign partitioning algorithm

CUDA initialization error at raw_delete when crossing epoch boundary

Last entry in the dataset is causing "Relative sample index $x is not present" error

← Metadata

Owner

Metadata

streaming streaming copied to clipboard

Metadata

← Metadata

Owner

Metadata

streaming
streaming copied to clipboard