litdata
litdata copied to clipboard
Assert when deserializing `no_header_numpy` or `no_header_tensor`.
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- Create/serialize a dataset with integer tensor or numpy.
- Read/deserialize the created dataset.
Code sample
from litdata import optimize
import numpy as np
from litdata.streaming import StreamingDataLoader, StreamingDataset
def random_images(index):
data = {
"index": index, # int data type
"class": np.arange(1, 100), # numpy array data type
}
# The data is serialized into bytes and stored into data chunks by the optimize operator.
return data # The data is serialized into bytes and stored into data chunks by the optimize operator.
if __name__ == "__main__":
optimize(
fn=random_images, # The function applied over each input.
inputs=list(range(10)), # Provide any inputs. The fn is applied on each item.
output_dir="my_optimized_dataset", # The directory where the optimized data are stored.
num_workers=0, # The number of workers. The inputs are distributed among them.
chunk_bytes="64MB", # The maximum number of bytes to write into a data chunk.
)
dataset = StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
dataloader = StreamingDataLoader(
dataset,
num_workers=0,
batch_size=1,
drop_last=False,
shuffle=False,
)
for data in dataloader:
print(data)
Expected behavior
Read and print the batch data.
Environment
- PyTorch Version (e.g., 1.0): 2.1.2
- OS (e.g., Linux): MacOS and Linux
- How you installed PyTorch (
conda
,pip
, source): pip install - Build command you used (if compiling from source):
- Python version: 3.11
- CUDA/cuDNN version: N/A
- GPU models and configuration: N/A
- Any other relevant information:
Additional context
Assert stack
Traceback (most recent call last):
File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
for data in dataloader:
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
for batch in super().__iter__():
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
data = self.__getitem__(
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
return self.cache[index]
~~~~~~~~~~^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
return self._reader.read(index)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
return self.deserialize(data)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
data.append(serializer.deserialize(data_bytes))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
assert self._dtype
AssertionError