litdata icon indicating copy to clipboard operation
litdata copied to clipboard

`OutOfBoundsError` when streaming parquet files with `low_memory=True`

Open kyoungrok0517 opened this issue 7 months ago • 9 comments

🐛 Bug

To Reproduce

I'm trying to stream my parquet dataset with low_memory=True option but I encounter OutOfBoundsError. Having no problem when I set low_memory=False. The parquet files are compressed in zstd.

Error
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__                                                                           
[rank6]:     return self.cache[index]                                                                                                                                                                                  
[rank6]:            ~~~~~~~~~~^^^^^^^                                                                                                                                                                                  
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__                                                                             
[rank6]:     return self._reader.read(index)                                                                                                                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read                                                                                   
[rank6]:     item = self._item_loader.load_item_from_chunk(                                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk                                                              
[rank6]:     return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)                                                                                                                        
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                        
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory                                                         
[rank6]:     return row_group_df.row(row_index_within_group, named=True)  # type: ignore                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row                                                                                    
[rank6]:     row = self._df.row_tuple(index)                                                                                                                                                                           
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791   
Code sample
def _load_dataset(self):
    # Indexed folder
    data_path = Path(self.data_path)
    # Low memory
    item_loader = ParquetLoader(low_memory=True)
    dset = {
        "train": TripleStreamingDataset(
            str(data_path / "split=train"),
            item_loader=item_loader,
        ),
        "val": TripleStreamingDataset(
            str(data_path / "split=val"),
            item_loader=item_loader,
        ),
        "test": TripleStreamingDataset(
            str(data_path / "split=test"),
            item_loader=item_loader,
        ),
    }

    return dset

def setup(self, stage=None):
    # load
    dset = self._load_dataset()

    # train, val, test split
    self.train_dataset = dset["train"]
    self.val_dataset = dset["val"]
    self.test_dataset = dset["test"]

def train_dataloader(self):
    return StreamingDataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        **self.common_opts,
    )

def val_dataloader(self):
    return StreamingDataLoader(
        self.val_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

def test_dataloader(self):
    return StreamingDataLoader(
        self.test_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

Expected behavior

Work without any problem

Additional context

Environment detail
  • PyTorch Version: 2.6.0
  • OS: Ubuntu 20.04
  • How you installed PyTorch: uv pip
  • Build command you used: ...
  • Python version: 3.11
  • CUDA/cuDNN version: 12.4
  • GPU models and configuration: A100 * 8
  • Any other relevant information: Using lighting + DDP for training

kyoungrok0517 avatar Apr 13 '25 16:04 kyoungrok0517

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Apr 13 '25 16:04 github-actions[bot]

Hi @kyoungrok0517, thanks a lot for opening the issue!

To help us look into it, could you also share a bit more detail about the dataset? If it’s publicly available, a link to the Parquet file would be super helpful. Otherwise, any metadata or a sample schema from the file that's triggering the error could help us reproduce and debug it on our end.

Appreciate your help—thanks!

bhimrazy avatar Apr 13 '25 16:04 bhimrazy

@bhimrazy Hello. Thanks for the support! Here's the schema and samples of my data. They are pairs of a query and a site, where each site is composed of many document embeddings. I've obfuscated the query and site for privacy. I'm using DistributedSampler and num_workers=8 in lightning to read this data in DDP mode (n_gpu=8).

Schema([('query', String),
        ('site', String),
        ('rank', Int32),
        ('query_embedding', List(Float64)),
        ('site_embeddings', List(List(Float64))),
        ('attention_mask', List(Int32))])
{'query': shape: (1,)
 Series: 'query' [str]
[
	"NbrnTP3fAbnF"
],
'site': shape: (1,)
Series: 'site' [str]
[
	"http://jeunovhpmf.com/i2t1py"
],
'rank': shape: (1,)
Series: 'rank' [i32]
[
	12
],
'query_embedding': shape: (1,)
Series: 'query_embedding' [list[f64]]
[
	[0.146101, 0.047645, … 0.054902]
],
'site_embeddings': shape: (1,)
Series: 'site_embeddings' [list[list[f64]]]
[
	[[-0.0821, 0.0113, … 0.0124], [0.137, 0.0046, … 0.0979], … [-0.2549, -0.0328, … -0.3506]]
],
'attention_mask': shape: (1,)
Series: 'attention_mask' [list[i32]]
[
	[1, 1, … 1]
]}

kyoungrok0517 avatar Apr 13 '25 22:04 kyoungrok0517

Thank you for sharing the details, @kyoungrok0517! I'll give it a try with the sample dataset and see if I can reproduce the issue on my end.

bhimrazy avatar Apr 14 '25 10:04 bhimrazy

@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there

tchaton avatar Apr 15 '25 09:04 tchaton

@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there

@tchaton Thanks for the suggestion! I don't think it's related to compression—the file fully does exists before reading due to atomic download thing.
Also, since low_memory=False works, that likely rules out compression issues.

I suspect it might be related to the number of rows within row groups differing across a chunk. Haven’t reproduced it with the sample yet, but I’ll run a few more tests with DDP to be sure.

bhimrazy avatar Apr 16 '25 06:04 bhimrazy

Update on the DDP Test:
I've run tests under DDP using the provided sample data, as well as other datasets like OpenThoughts-114k and fineweb-edu/sample/10BT. So far, I haven’t been able to reproduce the same bug.

It’s still unclear the actual reason behind it.

By the way, @kyoungrok0517 — would it be possible for you to create and share a Lightning Studio studio that reproduces the issue using your sample or any dataset where it appears? Thank you!

cc: @tchaton

bhimrazy avatar Apr 22 '25 07:04 bhimrazy

@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.

kyoungrok0517 avatar Apr 24 '25 02:04 kyoungrok0517

@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.

Thank you, @kyoungrok0517 — really appreciate it!

bhimrazy avatar Apr 24 '25 03:04 bhimrazy

Hi @kyoungrok0517 👋 — just checking in to see if you’ve had a chance to create the Lightning Studio project that reproduces the issue. No rush at all—just wanted to follow up in case you’ve had a chance or encountered anything new.

If you need help setting it up, this guide might be useful: https://www.youtube.com/watch?v=YcW-2Zt_bFg

Thanks again 🙏

bhimrazy avatar Jun 03 '25 19:06 bhimrazy

Thanks again for reporting this issue, @kyoungrok0517 🙏

Since we haven’t been able to reproduce the error with the sample data and there hasn’t been further activity on a reproducible Lightning Studio project, I’ll go ahead and close this issue for now.

If you’re still running into the same problem or manage to create a minimal reproducible example, please feel free to re-open or open a new issue—we’ll be happy to take another look.

Thanks for your time and effort in helping improve litData! 🚀

bhimrazy avatar Sep 02 '25 19:09 bhimrazy