polars icon indicating copy to clipboard operation
polars copied to clipboard

read_parquet stops execution without error

Open MariusMerkleQC opened this issue 1 year ago • 8 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

When loading a subset of columns using pl.read_parquet by specifying a list of columns with the columns argument, the execution

  • stops without an error message after ~1 min when use_pyarrow=False (by default)
  • succeeds when using use_pyarrow=True (using the native Rust parquet reader)

The columns specified are only a single column that contains a string of variable length (1, 40, 72, or 79). Interestingly, the data import works for the very same parquet file with use_pyarrow=False when loading all columns.

Any ideas why this could be the case?

Reproducible example

The problem is caused by a dataset I cannot share.

Expected behavior

Properly load the single column into a pl.DataFrame.

Installed versions

--------Version info---------
Polars:      0.17.12
Index type:  UInt32
Platform:    macOS-13.2.1-arm64-arm-64bit
Python:      3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:12:31) [Clang 14.0.6 ]

----Optional dependencies----
numpy:       1.23.5
pandas:      2.0.1
pyarrow:     9.0.0
connectorx:  0.3.0
deltalake:   <not installed>
fsspec:      2023.4.0
matplotlib:  3.7.1
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>

MariusMerkleQC avatar May 06 '23 13:05 MariusMerkleQC

I think this is a duplicate of: #7237

ritchie46 avatar May 06 '23 18:05 ritchie46

I'm not sure, I don't have an "infinite recursion error" in my case. What would the solution be if they are duplicates?

MariusMerkleQC avatar May 08 '23 06:05 MariusMerkleQC

This doesn't run in python or any other runtime that checks recursion overflow. The recursion will overflows the stack and then the process is killed with a segfault.

ritchie46 avatar May 08 '23 07:05 ritchie46

Can you share the file?

ritchie46 avatar May 08 '23 07:05 ritchie46

Sorry, I cannot share the file due to data protection reasons. But the overall number of characters is ~7.5 billion (~133 Mio rows with each ~56 characters), so it get's quite close to the 10 GB limit.

MariusMerkleQC avatar May 08 '23 07:05 MariusMerkleQC

Yes, that's the same problem. Could you check how many row groups and data pages are in the file? Was the file written with polars?

ritchie46 avatar May 08 '23 08:05 ritchie46

The file was written with polars, yes.

How can I get the number of row groups and data pages?

MariusMerkleQC avatar May 08 '23 08:05 MariusMerkleQC

Could you try writing the file with a row group size of 500k?

ritchie46 avatar May 08 '23 08:05 ritchie46

Closing due to the lack of a reproducible example.

stinodego avatar Jan 19 '24 22:01 stinodego