litdata
litdata copied to clipboard
Compression using the optimize function from litdata
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Just use the https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming.
Modify it to using litdata, instead of lightning.data
Add to the optimize function in convert.py any compression method for example "zstd".
Code sample
optimize(convert_parquet_to_lightning_data, parquet_files[:10], output_dir, num_workers=os.cpu_count(), chunk_bytes="64MB", compression="gzip")
Expected behavior
Just creation of the compressed shards. Error:
Traceback (most recent call last):
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 426, in run
self._setup()
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 436, in _setup
self._create_cache()
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 511, in _create_cache
self.cache = Cache(
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/cache.py", line 65, in __init__
self._writer = BinaryWriter(
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/writer.py", line 85, in __init__
raise ValueError("No compresion algorithms are installed.")
ValueError: No compresion algorithms are installed.
Environment
lightning-ai, with pip install litdata
Additional context
Check the size of the dataset (compressed and uncompressed - in my first implementation on aws, i got same size for the data set.
Hi! thanks for your contribution!, great first issue!
Using pip install mosaicml-streaming resolves the error above, maybe some dependencies should be added to litdata.
Using then zstd -> and exucting the stream.py => Finished data processing! ⚡ ~ /home/zeus/miniconda3/envs/cloudspace/bin/python /teamspace/studios/this_studio/stream.py 8200
Traceback (most recent call last):
File "/teamspace/studios/this_studio/stream.py", line 19, in <module>
print(f'{dataset[0]}')
File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/dataset.py", line 244, in __getitem__
return self.cache[index]
File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/cache.py", line 132, in __getitem__
return self._reader.read(index)
File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/reader.py", line 252, in read
item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin)
File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 104, in load_item_from_chunk
return self.deserialize(data)
File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 116, in deserialize
return tree_unflatten(data, self._config["data_spec"])
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/_pytree.py", line 261, in tree_unflatten
raise ValueError(
ValueError: tree_unflatten(values, spec): `values` has length 0 but the spec refers to a pytree that holds 4 items (TreeSpec(tuple, None, [*,
*,
*,
*]))
Hey @rakro101, I published a new version. Can you try again ?
@tchaton it works now, but the ending should be .zstd instead of .bin