litdata
litdata copied to clipboard
`OutOfBoundsError` when streaming parquet files with `low_memory=True`
🐛 Bug
To Reproduce
I'm trying to stream my parquet dataset with low_memory=True option but I encounter OutOfBoundsError. Having no problem when I set low_memory=False. The parquet files are compressed in zstd.
Error
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__
[rank6]: return self.cache[index]
[rank6]: ~~~~~~~~~~^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__
[rank6]: return self._reader.read(index)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read
[rank6]: item = self._item_loader.load_item_from_chunk(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk
[rank6]: return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory
[rank6]: return row_group_df.row(row_index_within_group, named=True) # type: ignore
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row
[rank6]: row = self._df.row_tuple(index)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791
Code sample
def _load_dataset(self):
# Indexed folder
data_path = Path(self.data_path)
# Low memory
item_loader = ParquetLoader(low_memory=True)
dset = {
"train": TripleStreamingDataset(
str(data_path / "split=train"),
item_loader=item_loader,
),
"val": TripleStreamingDataset(
str(data_path / "split=val"),
item_loader=item_loader,
),
"test": TripleStreamingDataset(
str(data_path / "split=test"),
item_loader=item_loader,
),
}
return dset
def setup(self, stage=None):
# load
dset = self._load_dataset()
# train, val, test split
self.train_dataset = dset["train"]
self.val_dataset = dset["val"]
self.test_dataset = dset["test"]
def train_dataloader(self):
return StreamingDataLoader(
self.train_dataset,
batch_size=self.batch_size,
**self.common_opts,
)
def val_dataloader(self):
return StreamingDataLoader(
self.val_dataset,
batch_size=self.batch_size * 2,
**self.common_opts,
)
def test_dataloader(self):
return StreamingDataLoader(
self.test_dataset,
batch_size=self.batch_size * 2,
**self.common_opts,
)
Expected behavior
Work without any problem
Additional context
Environment detail
- PyTorch Version:
2.6.0 - OS: Ubuntu 20.04
- How you installed PyTorch:
uv pip - Build command you used:
... - Python version:
3.11 - CUDA/cuDNN version:
12.4 - GPU models and configuration:
A100 * 8 - Any other relevant information: Using
lighting+ DDP for training
Hi! thanks for your contribution!, great first issue!
Hi @kyoungrok0517, thanks a lot for opening the issue!
To help us look into it, could you also share a bit more detail about the dataset? If it’s publicly available, a link to the Parquet file would be super helpful. Otherwise, any metadata or a sample schema from the file that's triggering the error could help us reproduce and debug it on our end.
Appreciate your help—thanks!
@bhimrazy Hello. Thanks for the support! Here's the schema and samples of my data. They are pairs of a query and a site, where each site is composed of many document embeddings. I've obfuscated the query and site for privacy. I'm using DistributedSampler and num_workers=8 in lightning to read this data in DDP mode (n_gpu=8).
Schema([('query', String),
('site', String),
('rank', Int32),
('query_embedding', List(Float64)),
('site_embeddings', List(List(Float64))),
('attention_mask', List(Int32))])
{'query': shape: (1,)
Series: 'query' [str]
[
"NbrnTP3fAbnF"
],
'site': shape: (1,)
Series: 'site' [str]
[
"http://jeunovhpmf.com/i2t1py"
],
'rank': shape: (1,)
Series: 'rank' [i32]
[
12
],
'query_embedding': shape: (1,)
Series: 'query_embedding' [list[f64]]
[
[0.146101, 0.047645, … 0.054902]
],
'site_embeddings': shape: (1,)
Series: 'site_embeddings' [list[list[f64]]]
[
[[-0.0821, 0.0113, … 0.0124], [0.137, 0.0046, … 0.0979], … [-0.2549, -0.0328, … -0.3506]]
],
'attention_mask': shape: (1,)
Series: 'attention_mask' [list[i32]]
[
[1, 1, … 1]
]}
Thank you for sharing the details, @kyoungrok0517! I'll give it a try with the sample dataset and see if I can reproduce the issue on my end.
@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there
@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there
@tchaton Thanks for the suggestion! I don't think it's related to compression—the file fully does exists before reading due to atomic download thing.
Also, since low_memory=False works, that likely rules out compression issues.
I suspect it might be related to the number of rows within row groups differing across a chunk. Haven’t reproduced it with the sample yet, but I’ll run a few more tests with DDP to be sure.
Update on the DDP Test:
I've run tests under DDP using the provided sample data, as well as other datasets like OpenThoughts-114k and fineweb-edu/sample/10BT. So far, I haven’t been able to reproduce the same bug.
It’s still unclear the actual reason behind it.
By the way, @kyoungrok0517 — would it be possible for you to create and share a Lightning Studio studio that reproduces the issue using your sample or any dataset where it appears? Thank you!
cc: @tchaton
@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.
@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.
Thank you, @kyoungrok0517 — really appreciate it!
Hi @kyoungrok0517 👋 — just checking in to see if you’ve had a chance to create the Lightning Studio project that reproduces the issue. No rush at all—just wanted to follow up in case you’ve had a chance or encountered anything new.
If you need help setting it up, this guide might be useful: https://www.youtube.com/watch?v=YcW-2Zt_bFg
Thanks again 🙏
Thanks again for reporting this issue, @kyoungrok0517 🙏
Since we haven’t been able to reproduce the error with the sample data and there hasn’t been further activity on a reproducible Lightning Studio project, I’ll go ahead and close this issue for now.
If you’re still running into the same problem or manage to create a minimal reproducible example, please feel free to re-open or open a new issue—we’ll be happy to take another look.
Thanks for your time and effort in helping improve litData! 🚀