polars
polars copied to clipboard
read_parquet stops execution without error
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
When loading a subset of columns using pl.read_parquet
by specifying a list of columns with the columns
argument, the execution
- stops without an error message after ~1 min when
use_pyarrow=False
(by default) - succeeds when using
use_pyarrow=True
(using the native Rust parquet reader)
The columns specified are only a single column that contains a string of variable length (1, 40, 72, or 79). Interestingly, the data import works for the very same parquet
file with use_pyarrow=False
when loading all columns.
Any ideas why this could be the case?
Reproducible example
The problem is caused by a dataset I cannot share.
Expected behavior
Properly load the single column into a pl.DataFrame
.
Installed versions
--------Version info---------
Polars: 0.17.12
Index type: UInt32
Platform: macOS-13.2.1-arm64-arm-64bit
Python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:12:31) [Clang 14.0.6 ]
----Optional dependencies----
numpy: 1.23.5
pandas: 2.0.1
pyarrow: 9.0.0
connectorx: 0.3.0
deltalake: <not installed>
fsspec: 2023.4.0
matplotlib: 3.7.1
xlsx2csv: <not installed>
xlsxwriter: <not installed>
I think this is a duplicate of: #7237
I'm not sure, I don't have an "infinite recursion error" in my case. What would the solution be if they are duplicates?
This doesn't run in python or any other runtime that checks recursion overflow. The recursion will overflows the stack and then the process is killed with a segfault.
Can you share the file?
Sorry, I cannot share the file due to data protection reasons. But the overall number of characters is ~7.5 billion (~133 Mio rows with each ~56 characters), so it get's quite close to the 10 GB limit.
Yes, that's the same problem. Could you check how many row groups and data pages are in the file? Was the file written with polars?
The file was written with polars, yes.
How can I get the number of row groups and data pages?
Could you try writing the file with a row group size of 500k?
Closing due to the lack of a reproducible example.