litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Assert when deserializing `no_header_numpy` or `no_header_tensor`.

Open ouj opened this issue 1 year ago • 4 comments

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Create/serialize a dataset with integer tensor or numpy.
  2. Read/deserialize the created dataset.

Code sample

from litdata import optimize
import numpy as np
from litdata.streaming import StreamingDataLoader, StreamingDataset


def random_images(index):
    data = {
        "index": index,  # int data type
        "class": np.arange(1, 100),  # numpy array data type
    }
    # The data is serialized into bytes and stored into data chunks by the optimize operator.
    return data  # The data is serialized into bytes and stored into data chunks by the optimize operator.


if __name__ == "__main__":
    optimize(
        fn=random_images,  # The function applied over each input.
        inputs=list(range(10)),  # Provide any inputs. The fn is applied on each item.
        output_dir="my_optimized_dataset",  # The directory where the optimized data are stored.
        num_workers=0,  # The number of workers. The inputs are distributed among them.
        chunk_bytes="64MB",  # The maximum number of bytes to write into a data chunk.
    )

    dataset = StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
    dataloader = StreamingDataLoader(
        dataset,
        num_workers=0,
        batch_size=1,
        drop_last=False,
        shuffle=False,
    )

    for data in dataloader:
        print(data)

Expected behavior

Read and print the batch data.

Environment

  • PyTorch Version (e.g., 1.0): 2.1.2
  • OS (e.g., Linux): MacOS and Linux
  • How you installed PyTorch (conda, pip, source): pip install
  • Build command you used (if compiling from source):
  • Python version: 3.11
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: N/A
  • Any other relevant information:

Additional context

Assert stack

Traceback (most recent call last):
  File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
    for data in dataloader:
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
    for batch in super().__iter__():
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
    data = self.__getitem__(
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
    return self.cache[index]
           ~~~~~~~~~~^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
    return self._reader.read(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
    return self.deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
    data.append(serializer.deserialize(data_bytes))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
    assert self._dtype
AssertionError

ouj avatar Apr 04 '24 21:04 ouj

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Apr 04 '24 21:04 github-actions[bot]

Looks like the setup() method on NoHeaderTensorSerializer and NoHeaderNumpySerializer wasn't called before deserialize was called.

ouj avatar Apr 04 '24 21:04 ouj

Okay... found a workaround. The problem is the the numpy array is a 1D array.

The fix is to reshape that to a 2D array to create an "header"? 🤯

np.arange(10).reshape(1, -1)

ouj avatar Apr 04 '24 21:04 ouj

Hey @ouj. Yes, 1D data is handled differently to handle tokens for training LLMs. This isn't a nice behaviour and I meant to provide a better mechanism but I never got to it.

tchaton avatar Apr 05 '24 15:04 tchaton

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 06:04 stale[bot]

This issue appears to have been resolved by #401 or before that.

Tested with the provided sample and the issue no longer seems to reproduce.

Please feel free to reopen the issue if that's not the case.

bhimrazy avatar Apr 16 '25 07:04 bhimrazy