litdata Assert when deserializing `no_header_numpy` or `no_header

Assert when deserializing `no_header_numpy` or `no_header_tensor`.

Open ouj opened this issue 10 months ago • 4 comments

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Create/serialize a dataset with integer tensor or numpy.
Read/deserialize the created dataset.

Code sample

from litdata import optimize
import numpy as np
from litdata.streaming import StreamingDataLoader, StreamingDataset


def random_images(index):
    data = {
        "index": index,  # int data type
        "class": np.arange(1, 100),  # numpy array data type
    }
    # The data is serialized into bytes and stored into data chunks by the optimize operator.
    return data  # The data is serialized into bytes and stored into data chunks by the optimize operator.


if __name__ == "__main__":
    optimize(
        fn=random_images,  # The function applied over each input.
        inputs=list(range(10)),  # Provide any inputs. The fn is applied on each item.
        output_dir="my_optimized_dataset",  # The directory where the optimized data are stored.
        num_workers=0,  # The number of workers. The inputs are distributed among them.
        chunk_bytes="64MB",  # The maximum number of bytes to write into a data chunk.
    )

    dataset = StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
    dataloader = StreamingDataLoader(
        dataset,
        num_workers=0,
        batch_size=1,
        drop_last=False,
        shuffle=False,
    )

    for data in dataloader:
        print(data)

Expected behavior

Read and print the batch data.

Environment

PyTorch Version (e.g., 1.0): 2.1.2
OS (e.g., Linux): MacOS and Linux
How you installed PyTorch (conda, pip, source): pip install
Build command you used (if compiling from source):
Python version: 3.11
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information:

Additional context

Assert stack

Traceback (most recent call last):
  File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
    for data in dataloader:
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
    for batch in super().__iter__():
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
    data = self.__getitem__(
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
    return self.cache[index]
           ~~~~~~~~~~^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
    return self._reader.read(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
    return self.deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
    data.append(serializer.deserialize(data_bytes))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
    assert self._dtype
AssertionError

Apr 04 '24 21:04 ouj

litdata litdata copied to clipboard

Assert when deserializing `no_header_numpy` or `no_header_tensor`.

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

litdata
litdata copied to clipboard