arrow
arrow copied to clipboard
OSError: List index overflow.
Hello,
I am storing pandas dataframe as .parquet with pd.to_parquet and then try to load them back with pd.read_parquet. I am experiencing some error for which I do not find solution and would kindly ask for help to solve this please ...
Here is the trace:
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.
If I store a small dataframe, I do not face this error. If I store a larger dataframe with e.g. 295.912.999 rows then I get this error.
However before saving it, I print the index range and it is bound in 0 295912998. Whether I save the .parquet with index=True or False gives the same error but I do not understand why there is an overflow on the bounded index ...
Any hints are much appreciated, thanks !
Looks the error message is from below code. https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64 It's strange as total rows 295,912,999 is far from int32 limit 2,147,483,647.
I did a quick test (arrow-9.0) on a single column dataframe with total rows more than max int32 . It works correctly.
@adrienchaton Will you share the dataframe, or the generating steps?
cc @jorisvandenbossche for comments.
Thanks for looking this up. Unfortunately its not possible to put together the generating steps of this dataframe in a self-contained script. But I could share the resulting dataframe that triggers the error when trying to read, although its 14GB large ...
Some more observations which maybe help. _ if I run the same codes but instead of saving the dataframe into a single parquet, do a numpy array_split into e.g. 20 chunks saved separately (about 700MB each), I can load these smaller chunks and possibly concatenate them back (I was looking at a workaround). _ the index datatype is int64. _ if I only load one column then I do not get the error.
I suspect that the "flattened" size (rows*columns) of the dataframe is probably too big then ?
I suspect that the "flattened" size (rows*columns) of the dataframe is probably too big then ?
In theory that shouldn't matter (only the size of individual columns).
What data types are in the file?
Whether I save the .parquet with index=True or False gives the same error but I do not understand why there is an overflow on the bounded index ...
What do you mean exactly with "bounded index" here? Does it matter what index you have to trigger the error?
the column dtypes are Int64(1), UInt16(15), UInt64(1), bool(11), boolean(6), object(15) the object columns are strings
what I meant by bounded index is that if I print the dataframe index before storing it to parquet, I get no overflow error (just a proper integer range) however if I try to load this dataframe from the parquet file, I get an overflow error
I believe you saying that the number of columns shouldn't matter but somehow it is strange that I do not get an overflow error if I only load a single column subset of the dataframe ...
in the meantime its not that bad to use the trick of chunking the dataframe into several .parquet files and concatenate them back when loading ... just bugging me (and I used to store hundreds of millions rows sized dataframes into single parquet files without this error popping)
I believe you saying that the number of columns shouldn't matter but somehow it is strange that I do not get an overflow error if I only load a single column subset of the dataframe ...
Loading a single column subset works for some columns, or does it work for all columns? (you could maybe test that with a loop) There might be one of the columns that is specifically triggering the error.
But I could share the resulting dataframe that triggers the error when trying to read, although its 14GB large ...
That's indeed a bit large. Could you first try to see if you still get the error when reducing the file size a bit? (for example, if you only save half of the columns in the file, do you still have the issue on read? Or if you take only 50 or 75% of the number of rows, do you still have the issue?)
For creating a script to generate the data instead, one possible approach that might reproduce the issue can be to get a tiny sample of the data, save that, and then see if a script that generates a large file from that sample still reproduces it. For example, like:
subset = pd.read_parquet("data_subset.parquet")
df = pd.concat([subset]*1000, ignore_index=True) # use number here needed to get the size of data to reproduce the issue)
df.to_parquet("data.parquet")
# does this still error?
pd.read_parquet(data.parquet")
Thanks for the tips. I tried loading the large dataframe column by column and found out two columns that caused the error. These columns are the only ones containing themselves lists of integers (per row).
One observation, if I load these columns from the smaller dataframes (e.g. those I got after chunking the large dataframe into 20 files) I do not get the error. But if I try to load these columns from the large dataframe storage, then I get the error.
This explains why I started to have this error that I never saw before. I gathered more metadata in my dataframes, amongst were some metadata formatted as lists of integers and then I trigger this error.
In the end, it seems like if I store a ~300M rows dataframe into a single parquet, I do not get this reading error if all columns contains a single element per row (e.g. a string, an integer, a float and so on). But if I had a column which contains a list of ~200 integers per row, I trigger the error.
Does that make sense to you ?
Managed to reproduced this error from a dataset with a single column containing a list of integers.
- to generate the dataset
import numpy as np
import pandas as pd
# total rows < max(int32)
n_rows = 108000000
# dataframe has only one column containing a list of 200 integers
# 200 * n_rows > max(int32)
data = [np.zeros(200, dtype='int8')] * n_rows
print('generating...')
df = pd.DataFrame()
# only one column
df['a'] = data
print('saving ...')
df.to_parquet('/tmp/pq')
print('done')
- to load the dataset
import pandas as pd
print('loading...')
df = pd.read_parquet('/tmp/pq', use_threads=False)
print('size = {}'.format(df.shape))
Tested with pyarrow-9.0.0 and pandas-1.5. Loading dataset failed with OSError: List index overflow..
NOTE: loading the dataset leads to "out of memory kill" on a machine with 128G RAM. I have to test it on a 256G RAM machine.
Filed an jira issue: https://issues.apache.org/jira/browse/ARROW-17983
I commented on the JIRA for longer term fixes. I think if you want to save all of this data as one parquet file, the path that should round trip is if you control the conversion from Pandas to Arrow and use the LargeList type instead of List for columns that contain lists. The Arrow Schema is persisted in the Parquet file and is then used to read it back, which will allow for 64-bit instead of 32-bit indices.
It would be nice to be able to set Large types as a flag or infer large types are necessary based on Parquet statistics (I noted these in the JIRA).
Another thing that could help is if we could break the data-frame across row-groups (I'm surprised for this size dataframe we don't have defaults in place that would already do that).
Another thing that could help is if we could break the data-frame across row-groups (I'm surprised for this size dataframe we don't have defaults in place that would already do that).
I don't fully understand how the dataframe is involved here. If I read the above correctly, it is the reading of a Parquet file into an Arrow table that is failing? (and not the conversion of the dataframe -> pyarrow table (for writing), or after reading the conversion of pyarrow table -> dataframe)
When converting a large dataframe like this, I think we automatically use chunked array to be able to represent this in a ListType? But when reading from Parquet, I would assume we also use chunks per record batch?
I tried to reproduce this using a smaller example (but just large enough to not fit in a single ListArray), so I could test this on my laptop with limited memory:
n_rows = 12000000
data = [np.zeros(200, dtype='int8')] * n_rows
df = pd.DataFrame({'a': data})
table = pa.table(df)
>>> table.schema
a: list<item: int8>
child 0, item: int8
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 389
>>> table["a"].num_chunks # <--- needed to use 2 chunks to fit all data in a ListType
2
import pyarrow.parquet as pq
pq.write_table(table, "test_large_list.parquet")
But the above is hanging at the write_table command (after first taking up a lot of memory and CPU consumption, at some point it stops doing anything (no significant CPU usage anymore) but the file is also not written (only a 4kB file), and trying to kill the process with Ctrl-C also doesn't work then)
Maybe you need more patience :-)
Attached a dataset created from below code. The parquet file is about 15M, after ziping it's only 60K. I saw rss jumps to above 30G during loading the file, not sure if a laptop can read it.
n_rows = 10800000
data = [np.zeros(200, dtype='bool')] * n_rows
Maybe you need more patience :-)
That was indeed the case ;) I did it again watching the process, and it actually kept doing something. Until it finished after 10 min having created a file of 2.9 MB ... (this seems a very long time for writing this size of dataset (2GB in memory), but it's also a very special case in that it's all the same numbers and thus encodes / compresses very well)
But after managing to write the file, reading the file still killed the process, so I assume I don't have enough memory to test this.
I don't fully understand how the dataframe is involved here. If I read the above correctly, it is the reading of a Parquet file into an Arrow table that is failing? (and not the conversion of the dataframe -> pyarrow table (for writing), or after reading the conversion of pyarrow table -> dataframe)
This is my understanding as well.
When converting a large dataframe like this, I think we automatically use chunked array to be able to represent this in a ListType? But when reading from Parquet, I would assume we also use chunks per record batch?
Yes, I wasn't thinking clearly. One possible conclusion is we aren't do chunking when reading from parquet->arrow->pandas? Is that possible?
FWIW I've had success using polars to read large parquet containing columns of lists of float64s and converting them to pandas afterwards when pd.read_parquet caused an error.
Any update on this issue? The workaround via Polars works but its extremely slow.
Same issue , any updates?
I think no one is working on this. Contribution is welcomed.
Same issue, fix is needed
Problem
Same here. I'm trying to load a 6GB parquet file with 3 cols two string cols and one with embeddings (arrays just like @aschmu described) in pandas with
df = pd.read_parquet("test.parquet")
File size in bytes: 6207538015 bytes
File size in kilobytes: 6062048.84 KB
Tried with Python 3.11 and pandas 2.0.3 and latest 2.1.3 on Windows (32Gb RAM) and Ubuntu (128Gb RAM):
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
File <timed exec>:4
File ~/anaconda3/lib/python3.11/site-packages/pandas/io/parquet.py:509, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, **kwargs)
506 use_nullable_dtypes = False
507 check_dtype_backend(dtype_backend)
--> 509 return impl.read(
510 path,
511 columns=columns,
512 storage_options=storage_options,
513 use_nullable_dtypes=use_nullable_dtypes,
514 dtype_backend=dtype_backend,
515 **kwargs,
516 )
File ~/anaconda3/lib/python3.11/site-packages/pandas/io/parquet.py:227, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, dtype_backend, storage_options, **kwargs)
220 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
221 path,
222 kwargs.pop("filesystem", None),
223 storage_options=storage_options,
224 mode="rb",
225 )
226 try:
--> 227 pa_table = self.api.parquet.read_table(
228 path_or_handle, columns=columns, **kwargs
229 )
230 result = pa_table.to_pandas(**to_pandas_kwargs)
232 if manager == "array":
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py:2973, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
2962 # TODO test that source is not a directory or a list
2963 dataset = ParquetFile(
2964 source, metadata=metadata, read_dictionary=read_dictionary,
2965 memory_map=memory_map, buffer_size=buffer_size,
(...)
2970 thrift_container_size_limit=thrift_container_size_limit,
2971 )
-> 2973 return dataset.read(columns=columns, use_threads=use_threads,
2974 use_pandas_metadata=use_pandas_metadata)
2976 warnings.warn(
2977 "Passing 'use_legacy_dataset=True' to get the legacy behaviour is "
2978 "deprecated as of pyarrow 8.0.0, and the legacy implementation will "
2979 "be removed in a future version.",
2980 FutureWarning, stacklevel=2)
2982 if ignore_prefixes is not None:
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py:2601, in _ParquetDatasetV2.read(self, columns, use_threads, use_pandas_metadata)
2593 index_columns = [
2594 col for col in _get_pandas_index_columns(metadata)
2595 if not isinstance(col, dict)
2596 ]
2597 columns = (
2598 list(columns) + list(set(index_columns) - set(columns))
2599 )
-> 2601 table = self._dataset.to_table(
2602 columns=columns, filter=self._filter_expression,
2603 use_threads=use_threads
2604 )
2606 # if use_pandas_metadata, restore the pandas metadata (which gets
2607 # lost if doing a specific `columns` selection in to_table)
2608 if use_pandas_metadata:
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
OSError: List index overflow.
The weird thing is that I processed 20 of these files with different file sizes and even bigger ones than this one (with 7GB) worked.
Workaround
UPDATE: Thanks a lot @aschmu for pointing in the right direction. I used polars too and convert to pandas so I can just continue with my normal workflow :)
This worked like a charm for my big files!
# pip install polars
import polars as pl
df = pl.read_parquet("test.parquet")
df = df.to_pandas()
# del df["__index_level_0__"] # if needed delete the polars specific column
Would iter_batches() as a workaround ok?
Would
iter_batches()as a workaround ok?
@mapleFU I don't think so. I was using IterableDataset from Hugging Face, which calls iter_batches() and got a similar error.
File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1385, in __iter__
for key, pa_table in iterator:
File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 167, in _batch_arrow_tables
for key, pa_table in iterable:
File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 289, in _iter_arrow
yield from self.generate_tables_fn(**self.kwargs)
File "/home/xxxx/miniforge3/envs/geospatial/lib/python3.11/site-packages/datasets/packaged_modules/parquet/parquet.py", line 90, in _generate_tables
for batch_idx, record_batch in enumerate(
File "pyarrow/_parquet.pyx", line 1587, in iter_batches
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: List index overflow.
number of elements of one piece of data should be less than max(int32),that is the upper limit of one parquet file. try to split the data to pieces by row so that you can save them in more parquet block files.