litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Compression using the optimize function from litdata

Open rakro101 opened this issue 1 year ago • 5 comments
trafficstars

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Just use the https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming.

Modify it to using litdata, instead of lightning.data

Add to the optimize function in convert.py any compression method for example "zstd".

Code sample

optimize(convert_parquet_to_lightning_data, parquet_files[:10], output_dir, num_workers=os.cpu_count(), chunk_bytes="64MB", compression="gzip")

Expected behavior

Just creation of the compressed shards. Error:

Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 426, in run
    self._setup()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 436, in _setup
    self._create_cache()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 511, in _create_cache
    self.cache = Cache(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/cache.py", line 65, in __init__
    self._writer = BinaryWriter(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/writer.py", line 85, in __init__
    raise ValueError("No compresion algorithms are installed.")
ValueError: No compresion algorithms are installed.

Environment

lightning-ai, with pip install litdata

Additional context

Check the size of the dataset (compressed and uncompressed - in my first implementation on aws, i got same size for the data set.

rakro101 avatar Apr 11 '24 07:04 rakro101

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Apr 11 '24 07:04 github-actions[bot]

Using pip install mosaicml-streaming resolves the error above, maybe some dependencies should be added to litdata.

rakro101 avatar Apr 11 '24 07:04 rakro101

Using then zstd -> and exucting the stream.py => Finished data processing! ⚡ ~ /home/zeus/miniconda3/envs/cloudspace/bin/python /teamspace/studios/this_studio/stream.py 8200

Traceback (most recent call last):
  File "/teamspace/studios/this_studio/stream.py", line 19, in <module>
    print(f'{dataset[0]}')
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/dataset.py", line 244, in __getitem__
    return self.cache[index]
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/cache.py", line 132, in __getitem__
    return self._reader.read(index)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 104, in load_item_from_chunk
    return self.deserialize(data)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 116, in deserialize
    return tree_unflatten(data, self._config["data_spec"])
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/_pytree.py", line 261, in tree_unflatten
    raise ValueError(
ValueError: tree_unflatten(values, spec): `values` has length 0 but the spec refers to a pytree that holds 4 items (TreeSpec(tuple, None, [*,
  *,
  *,
  *]))

rakro101 avatar Apr 11 '24 07:04 rakro101

Hey @rakro101, I published a new version. Can you try again ?

tchaton avatar Apr 11 '24 08:04 tchaton

@tchaton it works now, but the ending should be .zstd instead of .bin

rakro101 avatar Apr 22 '24 08:04 rakro101