polars
polars copied to clipboard
Panic on `read_parquet` related to `memory_map` (Operation timed out)
What language are you using?
Python
Have you tried latest version of polars?
Yes
What version of polars are you using?
0.13.59
What operating system are you using polars on?
MacOS 12.5
What language version are you using
Python 3.9.7 (default, Sep 3 2021, 12:37:55)
Describe your bug.
When using read_parquet on some parquet files, the panic below occurs.
What are the steps to reproduce the behavior?
When using read_parquet on several parquet files generated by Azure (see bottom for parquet-tools inspect) as follows:
Example
import polars as pl
pl.read_parquet(<parquet file>)
What is the actual behavior?
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 60, kind: TimedOut, message: "Operation timed out" }', /Users/runner/work/polars/polars/polars/polars-io/src/mmap.rs:71:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<path to venv>/venv/lib/python3.9/site-packages/polars/io.py", line 912, in read_parquet
return DataFrame._read_parquet(
File "<path to venv>/venv/lib/python3.9/site-packages/polars/internals/frame.py", line 712, in _read_parquet
self._df = PyDataFrame.read_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 60, kind: TimedOut, message: "Operation timed out" }
What is the expected behavior?
Successfully read a parquet file.
More information:
scan_parquetandread_parquet_schemawork on the file, so file seems to be validpyarrow(standalone) is able to read the file- When using
read_parquetwithuse_pyarrow=Trueandmemory_map=False, the file is read successfully. this seems to imply the issue is in the memory map feature memory_mapis described as only being used withuse_pyarrow=True, but that does not seem to be correct as irrespective of these flags,memory_mapseems to be used unlessmemory_map=False- other parquet files work, such as this one
- output of
parquet-tools inspecton the parquet file causing issues:
############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 9
num_rows: 206370
num_row_groups: 3
format_version: 1.0
serialized_size: 4048
############ Columns ############
Name
Creation-Time
Content-Length
Content-Type
hdi_isfolder
Owner
Group
Permissions
Acl
############ Column(Name) ############
name: Name
path: Name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 73%)
############ Column(Creation-Time) ############
name: Creation-Time
path: Creation-Time
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 3%)
############ Column(Content-Length) ############
name: Content-Length
path: Content-Length
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)
############ Column(Content-Type) ############
name: Content-Type
path: Content-Type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -3%)
############ Column(hdi_isfolder) ############
name: hdi_isfolder
path: hdi_isfolder
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 93%)
############ Column(Owner) ############
name: Owner
path: Owner
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 0%)
############ Column(Group) ############
name: Group
path: Group
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 3%)
############ Column(Permissions) ############
name: Permissions
path: Permissions
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 93%)
############ Column(Acl) ############
name: Acl
path: Acl
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 92%)
Thanks for the report. Are you certain we memory map if pyarrow=False?
https://github.com/pola-rs/polars/blob/ac916d2dadc132cb78268ca8ab17210d1bec3777/py-polars/polars/io.py#L812
We use pyarrow for memory mapping and AFAICT we only take that branch if use_pyarrow=True
Thanks for the fast reply. It seems read_parquet is not doing much with memory_map now that I look at it (you have linked to read_ipc). To be clear:
| use_pyarrow | memory_map | result |
|---|---|---|
| True | True | error |
| False | True | error |
| True | False | success |
| False | False | error |
| True | None | error |
| False | None | error |
(last 2 entries are explained by memory_map being True by default)
Oh sorry, I thought we were talking about IPC. For parquet we simply memory map the whole file and then copy it into memory. We do the same with the csv reader. Does that give you any problems as well?
@dkreeft could you share your file?
Unfortunately I cannot not share the file as it is a report of our Azure storage account, which contains confidential information. I tried to anonymize it, but when I read the file using pyarrow and use the parquet.write_table method, the newly written parquet is readable with polars. Is there another way to anonymize the data/delete the rows without changing the metadata of the file? Or can I somehow generate example files using the schema? The difference between both files when running parquet-tools inspect is (besides differences in compression percentages):
new file
############ file meta data ############
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 9
num_rows: 206370
num_row_groups: 1
format_version: 1.0
serialized_size: 2067
original file
############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 9
num_rows: 206370
num_row_groups: 3
format_version: 1.0
serialized_size: 4048
Note the difference in num_row_groups and serialized_size. Perhaps those differences lead to the shown behavior?
edit: using polars to read the parquet file and writing it right away, leads to the following output for parquet-tools inspect:
############ file meta data ############
created_by: Arrow2 - Native Rust implementation of Arrow
num_columns: 9
num_rows: 206370
num_row_groups: 1
format_version: 2.6
serialized_size: 1673
############ Columns ############
Name
Creation-Time
Content-Length
Content-Type
hdi_isfolder
Owner
Group
Permissions
Acl
############ Column(Name) ############
name: Name
path: Name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 76%)
############ Column(Creation-Time) ############
name: Creation-Time
path: Creation-Time
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: LZ4 (space_saved: 62%)
############ Column(Content-Length) ############
name: Content-Length
path: Content-Length
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: LZ4 (space_saved: 98%)
############ Column(Content-Type) ############
name: Content-Type
path: Content-Type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)
############ Column(hdi_isfolder) ############
name: hdi_isfolder
path: hdi_isfolder
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
compression: LZ4 (space_saved: 10%)
############ Column(Owner) ############
name: Owner
path: Owner
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)
############ Column(Group) ############
name: Group
path: Group
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)
############ Column(Permissions) ############
name: Permissions
path: Permissions
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)
############ Column(Acl) ############
name: Acl
path: Acl
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)
This file can also be read without problems using polars.
I will close this as we cannot do anything without being able to reproduce this. Feel free to open a new issue when you have something we can reproduce.