litdata
litdata copied to clipboard
Assert when deserializing `no_header_numpy` or `no_header_tensor`.
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- Create/serialize a dataset with integer tensor or numpy.
- Read/deserialize the created dataset.
Code sample
from litdata import optimize
import numpy as np
from litdata.streaming import StreamingDataLoader, StreamingDataset
def random_images(index):
data = {
"index": index, # int data type
"class": np.arange(1, 100), # numpy array data type
}
# The data is serialized into bytes and stored into data chunks by the optimize operator.
return data # The data is serialized into bytes and stored into data chunks by the optimize operator.
if __name__ == "__main__":
optimize(
fn=random_images, # The function applied over each input.
inputs=list(range(10)), # Provide any inputs. The fn is applied on each item.
output_dir="my_optimized_dataset", # The directory where the optimized data are stored.
num_workers=0, # The number of workers. The inputs are distributed among them.
chunk_bytes="64MB", # The maximum number of bytes to write into a data chunk.
)
dataset = StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
dataloader = StreamingDataLoader(
dataset,
num_workers=0,
batch_size=1,
drop_last=False,
shuffle=False,
)
for data in dataloader:
print(data)
Expected behavior
Read and print the batch data.
Environment
- PyTorch Version (e.g., 1.0): 2.1.2
- OS (e.g., Linux): MacOS and Linux
- How you installed PyTorch (
conda,pip, source): pip install - Build command you used (if compiling from source):
- Python version: 3.11
- CUDA/cuDNN version: N/A
- GPU models and configuration: N/A
- Any other relevant information:
Additional context
Assert stack
Traceback (most recent call last):
File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
for data in dataloader:
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
for batch in super().__iter__():
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
data = self.__getitem__(
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
return self.cache[index]
~~~~~~~~~~^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
return self._reader.read(index)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
return self.deserialize(data)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
data.append(serializer.deserialize(data_bytes))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
assert self._dtype
AssertionError
Hi! thanks for your contribution!, great first issue!
Looks like the setup() method on NoHeaderTensorSerializer and NoHeaderNumpySerializer wasn't called before deserialize was called.
Okay... found a workaround. The problem is the the numpy array is a 1D array.
The fix is to reshape that to a 2D array to create an "header"? 🤯
np.arange(10).reshape(1, -1)
Hey @ouj. Yes, 1D data is handled differently to handle tokens for training LLMs. This isn't a nice behaviour and I meant to provide a better mechanism but I never got to it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue appears to have been resolved by #401 or before that.
Tested with the provided sample and the issue no longer seems to reproduce.
Please feel free to reopen the issue if that's not the case.