arrow icon indicating copy to clipboard operation
arrow copied to clipboard

OSError: List index overflow.

Open adrienchaton opened this issue 3 years ago • 15 comments
trafficstars

Hello,

I am storing pandas dataframe as .parquet with pd.to_parquet and then try to load them back with pd.read_parquet. I am experiencing some error for which I do not find solution and would kindly ask for help to solve this please ...

Here is the trace:

  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.

If I store a small dataframe, I do not face this error. If I store a larger dataframe with e.g. 295.912.999 rows then I get this error.

However before saving it, I print the index range and it is bound in 0 295912998. Whether I save the .parquet with index=True or False gives the same error but I do not understand why there is an overflow on the bounded index ...

Any hints are much appreciated, thanks !

adrienchaton avatar Sep 24 '22 13:09 adrienchaton

Looks the error message is from below code. https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64 It's strange as total rows 295,912,999 is far from int32 limit 2,147,483,647.

I did a quick test (arrow-9.0) on a single column dataframe with total rows more than max int32 . It works correctly.

@adrienchaton Will you share the dataframe, or the generating steps?

cc @jorisvandenbossche for comments.

cyb70289 avatar Sep 28 '22 03:09 cyb70289

Thanks for looking this up. Unfortunately its not possible to put together the generating steps of this dataframe in a self-contained script. But I could share the resulting dataframe that triggers the error when trying to read, although its 14GB large ...

Some more observations which maybe help. _ if I run the same codes but instead of saving the dataframe into a single parquet, do a numpy array_split into e.g. 20 chunks saved separately (about 700MB each), I can load these smaller chunks and possibly concatenate them back (I was looking at a workaround). _ the index datatype is int64. _ if I only load one column then I do not get the error.

I suspect that the "flattened" size (rows*columns) of the dataframe is probably too big then ?

adrienchaton avatar Sep 28 '22 10:09 adrienchaton

I suspect that the "flattened" size (rows*columns) of the dataframe is probably too big then ?

In theory that shouldn't matter (only the size of individual columns).

What data types are in the file?

Whether I save the .parquet with index=True or False gives the same error but I do not understand why there is an overflow on the bounded index ...

What do you mean exactly with "bounded index" here? Does it matter what index you have to trigger the error?

jorisvandenbossche avatar Sep 28 '22 12:09 jorisvandenbossche

the column dtypes are Int64(1), UInt16(15), UInt64(1), bool(11), boolean(6), object(15) the object columns are strings

what I meant by bounded index is that if I print the dataframe index before storing it to parquet, I get no overflow error (just a proper integer range) however if I try to load this dataframe from the parquet file, I get an overflow error

I believe you saying that the number of columns shouldn't matter but somehow it is strange that I do not get an overflow error if I only load a single column subset of the dataframe ...

in the meantime its not that bad to use the trick of chunking the dataframe into several .parquet files and concatenate them back when loading ... just bugging me (and I used to store hundreds of millions rows sized dataframes into single parquet files without this error popping)

adrienchaton avatar Sep 29 '22 08:09 adrienchaton

I believe you saying that the number of columns shouldn't matter but somehow it is strange that I do not get an overflow error if I only load a single column subset of the dataframe ...

Loading a single column subset works for some columns, or does it work for all columns? (you could maybe test that with a loop) There might be one of the columns that is specifically triggering the error.

But I could share the resulting dataframe that triggers the error when trying to read, although its 14GB large ...

That's indeed a bit large. Could you first try to see if you still get the error when reducing the file size a bit? (for example, if you only save half of the columns in the file, do you still have the issue on read? Or if you take only 50 or 75% of the number of rows, do you still have the issue?)

For creating a script to generate the data instead, one possible approach that might reproduce the issue can be to get a tiny sample of the data, save that, and then see if a script that generates a large file from that sample still reproduces it. For example, like:

subset = pd.read_parquet("data_subset.parquet")
df = pd.concat([subset]*1000, ignore_index=True) # use number here needed to get the size of data to reproduce the issue)
df.to_parquet("data.parquet")
# does this still error?
pd.read_parquet(data.parquet")

jorisvandenbossche avatar Sep 29 '22 09:09 jorisvandenbossche

Thanks for the tips. I tried loading the large dataframe column by column and found out two columns that caused the error. These columns are the only ones containing themselves lists of integers (per row).

One observation, if I load these columns from the smaller dataframes (e.g. those I got after chunking the large dataframe into 20 files) I do not get the error. But if I try to load these columns from the large dataframe storage, then I get the error.

This explains why I started to have this error that I never saw before. I gathered more metadata in my dataframes, amongst were some metadata formatted as lists of integers and then I trigger this error.

In the end, it seems like if I store a ~300M rows dataframe into a single parquet, I do not get this reading error if all columns contains a single element per row (e.g. a string, an integer, a float and so on). But if I had a column which contains a list of ~200 integers per row, I trigger the error.

Does that make sense to you ?

adrienchaton avatar Sep 30 '22 14:09 adrienchaton

Managed to reproduced this error from a dataset with a single column containing a list of integers.

  • to generate the dataset
import numpy as np
import pandas as pd

# total rows < max(int32)
n_rows = 108000000

# dataframe has only one column containing a list of 200 integers
# 200 * n_rows > max(int32)
data = [np.zeros(200, dtype='int8')] * n_rows

print('generating...')
df = pd.DataFrame()
# only one column
df['a'] = data

print('saving ...')
df.to_parquet('/tmp/pq')
print('done')
  • to load the dataset
import pandas as pd

print('loading...')
df = pd.read_parquet('/tmp/pq', use_threads=False)
print('size = {}'.format(df.shape))

Tested with pyarrow-9.0.0 and pandas-1.5. Loading dataset failed with OSError: List index overflow..

NOTE: loading the dataset leads to "out of memory kill" on a machine with 128G RAM. I have to test it on a 256G RAM machine.

cyb70289 avatar Oct 08 '22 04:10 cyb70289

Filed an jira issue: https://issues.apache.org/jira/browse/ARROW-17983

cyb70289 avatar Oct 11 '22 05:10 cyb70289

I commented on the JIRA for longer term fixes. I think if you want to save all of this data as one parquet file, the path that should round trip is if you control the conversion from Pandas to Arrow and use the LargeList type instead of List for columns that contain lists. The Arrow Schema is persisted in the Parquet file and is then used to read it back, which will allow for 64-bit instead of 32-bit indices.

It would be nice to be able to set Large types as a flag or infer large types are necessary based on Parquet statistics (I noted these in the JIRA).

emkornfield avatar Oct 18 '22 05:10 emkornfield

Another thing that could help is if we could break the data-frame across row-groups (I'm surprised for this size dataframe we don't have defaults in place that would already do that).

emkornfield avatar Oct 18 '22 05:10 emkornfield

Another thing that could help is if we could break the data-frame across row-groups (I'm surprised for this size dataframe we don't have defaults in place that would already do that).

I don't fully understand how the dataframe is involved here. If I read the above correctly, it is the reading of a Parquet file into an Arrow table that is failing? (and not the conversion of the dataframe -> pyarrow table (for writing), or after reading the conversion of pyarrow table -> dataframe)

When converting a large dataframe like this, I think we automatically use chunked array to be able to represent this in a ListType? But when reading from Parquet, I would assume we also use chunks per record batch?

jorisvandenbossche avatar Oct 18 '22 12:10 jorisvandenbossche

I tried to reproduce this using a smaller example (but just large enough to not fit in a single ListArray), so I could test this on my laptop with limited memory:

n_rows = 12000000
data = [np.zeros(200, dtype='int8')] * n_rows
df = pd.DataFrame({'a': data})
table = pa.table(df)

>>> table.schema
a: list<item: int8>
  child 0, item: int8
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 389

>>> table["a"].num_chunks   # <--- needed to use 2 chunks to fit all data in a ListType
2

import pyarrow.parquet as pq
pq.write_table(table, "test_large_list.parquet")

But the above is hanging at the write_table command (after first taking up a lot of memory and CPU consumption, at some point it stops doing anything (no significant CPU usage anymore) but the file is also not written (only a 4kB file), and trying to kill the process with Ctrl-C also doesn't work then)

jorisvandenbossche avatar Oct 18 '22 12:10 jorisvandenbossche

Maybe you need more patience :-)

Attached a dataset created from below code. The parquet file is about 15M, after ziping it's only 60K. I saw rss jumps to above 30G during loading the file, not sure if a laptop can read it.

test.parquet.zip

n_rows = 10800000
data = [np.zeros(200, dtype='bool')] * n_rows

cyb70289 avatar Oct 18 '22 14:10 cyb70289

Maybe you need more patience :-)

That was indeed the case ;) I did it again watching the process, and it actually kept doing something. Until it finished after 10 min having created a file of 2.9 MB ... (this seems a very long time for writing this size of dataset (2GB in memory), but it's also a very special case in that it's all the same numbers and thus encodes / compresses very well)

But after managing to write the file, reading the file still killed the process, so I assume I don't have enough memory to test this.

jorisvandenbossche avatar Oct 18 '22 14:10 jorisvandenbossche

I don't fully understand how the dataframe is involved here. If I read the above correctly, it is the reading of a Parquet file into an Arrow table that is failing? (and not the conversion of the dataframe -> pyarrow table (for writing), or after reading the conversion of pyarrow table -> dataframe)

This is my understanding as well.

When converting a large dataframe like this, I think we automatically use chunked array to be able to represent this in a ListType? But when reading from Parquet, I would assume we also use chunks per record batch?

Yes, I wasn't thinking clearly. One possible conclusion is we aren't do chunking when reading from parquet->arrow->pandas? Is that possible?

emkornfield avatar Oct 18 '22 17:10 emkornfield

FWIW I've had success using polars to read large parquet containing columns of lists of float64s and converting them to pandas afterwards when pd.read_parquet caused an error.

aschmu avatar Feb 13 '23 10:02 aschmu

Any update on this issue? The workaround via Polars works but its extremely slow.

tecamenz avatar May 31 '23 09:05 tecamenz

Same issue , any updates?

paulacanva avatar Oct 26 '23 03:10 paulacanva

I think no one is working on this. Contribution is welcomed.

cyb70289 avatar Oct 26 '23 04:10 cyb70289

Same issue, fix is needed

oasidorshin avatar Nov 09 '23 12:11 oasidorshin

Problem

Same here. I'm trying to load a 6GB parquet file with 3 cols two string cols and one with embeddings (arrays just like @aschmu described) in pandas with

 df = pd.read_parquet("test.parquet")
File size in bytes: 6207538015 bytes
File size in kilobytes: 6062048.84 KB

Tried with Python 3.11 and pandas 2.0.3 and latest 2.1.3 on Windows (32Gb RAM) and Ubuntu (128Gb RAM):

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File <timed exec>:4

File ~/anaconda3/lib/python3.11/site-packages/pandas/io/parquet.py:509, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, **kwargs)
    506     use_nullable_dtypes = False
    507 check_dtype_backend(dtype_backend)
--> 509 return impl.read(
    510     path,
    511     columns=columns,
    512     storage_options=storage_options,
    513     use_nullable_dtypes=use_nullable_dtypes,
    514     dtype_backend=dtype_backend,
    515     **kwargs,
    516 )

File ~/anaconda3/lib/python3.11/site-packages/pandas/io/parquet.py:227, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, dtype_backend, storage_options, **kwargs)
    220 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    221     path,
    222     kwargs.pop("filesystem", None),
    223     storage_options=storage_options,
    224     mode="rb",
    225 )
    226 try:
--> 227     pa_table = self.api.parquet.read_table(
    228         path_or_handle, columns=columns, **kwargs
    229     )
    230     result = pa_table.to_pandas(**to_pandas_kwargs)
    232     if manager == "array":

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py:2973, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2962         # TODO test that source is not a directory or a list
   2963         dataset = ParquetFile(
   2964             source, metadata=metadata, read_dictionary=read_dictionary,
   2965             memory_map=memory_map, buffer_size=buffer_size,
   (...)
   2970             thrift_container_size_limit=thrift_container_size_limit,
   2971         )
-> 2973     return dataset.read(columns=columns, use_threads=use_threads,
   2974                         use_pandas_metadata=use_pandas_metadata)
   2976 warnings.warn(
   2977     "Passing 'use_legacy_dataset=True' to get the legacy behaviour is "
   2978     "deprecated as of pyarrow 8.0.0, and the legacy implementation will "
   2979     "be removed in a future version.",
   2980     FutureWarning, stacklevel=2)
   2982 if ignore_prefixes is not None:

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py:2601, in _ParquetDatasetV2.read(self, columns, use_threads, use_pandas_metadata)
   2593         index_columns = [
   2594             col for col in _get_pandas_index_columns(metadata)
   2595             if not isinstance(col, dict)
   2596         ]
   2597         columns = (
   2598             list(columns) + list(set(index_columns) - set(columns))
   2599         )
-> 2601 table = self._dataset.to_table(
   2602     columns=columns, filter=self._filter_expression,
   2603     use_threads=use_threads
   2604 )
   2606 # if use_pandas_metadata, restore the pandas metadata (which gets
   2607 # lost if doing a specific `columns` selection in to_table)
   2608 if use_pandas_metadata:

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: List index overflow.

The weird thing is that I processed 20 of these files with different file sizes and even bigger ones than this one (with 7GB) worked.

Workaround

UPDATE: Thanks a lot @aschmu for pointing in the right direction. I used polars too and convert to pandas so I can just continue with my normal workflow :)

This worked like a charm for my big files!

# pip install polars 
import polars as pl
df = pl.read_parquet("test.parquet")
df = df.to_pandas()
# del df["__index_level_0__"] # if needed delete the polars specific column

do-me avatar Nov 25 '23 09:11 do-me

Would iter_batches() as a workaround ok?

mapleFU avatar Nov 25 '23 11:11 mapleFU

Would iter_batches() as a workaround ok?

@mapleFU I don't think so. I was using IterableDataset from Hugging Face, which calls iter_batches() and got a similar error.

  File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1385, in __iter__
    for key, pa_table in iterator:
  File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 167, in _batch_arrow_tables
    for key, pa_table in iterable:
  File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 289, in _iter_arrow
    yield from self.generate_tables_fn(**self.kwargs)
  File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/packaged_modules/parquet/parquet.py", line 90, in _generate_tables
    for batch_idx, record_batch in enumerate(
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: List index overflow.

danielz02 avatar May 22 '24 01:05 danielz02

number of elements of one piece of data should be less than max(int32),that is the upper limit of one parquet file. try to split the data to pieces by row so that you can save them in more parquet block files.

janelu9 avatar Jul 19 '24 12:07 janelu9