polars icon indicating copy to clipboard operation
polars copied to clipboard

Panic on `read_parquet` related to `memory_map` (Operation timed out)

Open dkreeft opened this issue 3 years ago • 5 comments
trafficstars

What language are you using?

Python

Have you tried latest version of polars?

Yes

What version of polars are you using?

0.13.59

What operating system are you using polars on?

MacOS 12.5

What language version are you using

Python 3.9.7 (default, Sep 3 2021, 12:37:55)

Describe your bug.

When using read_parquet on some parquet files, the panic below occurs.

What are the steps to reproduce the behavior?

When using read_parquet on several parquet files generated by Azure (see bottom for parquet-tools inspect) as follows:

Example

import polars as pl
pl.read_parquet(<parquet file>)

What is the actual behavior?

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 60, kind: TimedOut, message: "Operation timed out" }', /Users/runner/work/polars/polars/polars/polars-io/src/mmap.rs:71:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<path to venv>/venv/lib/python3.9/site-packages/polars/io.py", line 912, in read_parquet
    return DataFrame._read_parquet(
  File "<path to venv>/venv/lib/python3.9/site-packages/polars/internals/frame.py", line 712, in _read_parquet
    self._df = PyDataFrame.read_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 60, kind: TimedOut, message: "Operation timed out" }

What is the expected behavior?

Successfully read a parquet file.

More information:

  • scan_parquet and read_parquet_schema work on the file, so file seems to be valid
  • pyarrow (standalone) is able to read the file
  • When using read_parquet with use_pyarrow=True and memory_map=False, the file is read successfully. this seems to imply the issue is in the memory map feature
  • memory_map is described as only being used with use_pyarrow=True, but that does not seem to be correct as irrespective of these flags, memory_map seems to be used unless memory_map=False
  • other parquet files work, such as this one
  • output of parquet-tools inspect on the parquet file causing issues:
############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 9
num_rows: 206370
num_row_groups: 3
format_version: 1.0
serialized_size: 4048


############ Columns ############
Name
Creation-Time
Content-Length
Content-Type
hdi_isfolder
Owner
Group
Permissions
Acl

############ Column(Name) ############
name: Name
path: Name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 73%)

############ Column(Creation-Time) ############
name: Creation-Time
path: Creation-Time
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 3%)

############ Column(Content-Length) ############
name: Content-Length
path: Content-Length
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(Content-Type) ############
name: Content-Type
path: Content-Type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -3%)

############ Column(hdi_isfolder) ############
name: hdi_isfolder
path: hdi_isfolder
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 93%)

############ Column(Owner) ############
name: Owner
path: Owner
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 0%)

############ Column(Group) ############
name: Group
path: Group
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 3%)

############ Column(Permissions) ############
name: Permissions
path: Permissions
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 93%)

############ Column(Acl) ############
name: Acl
path: Acl
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 92%)

dkreeft avatar Aug 03 '22 12:08 dkreeft

Thanks for the report. Are you certain we memory map if pyarrow=False?

https://github.com/pola-rs/polars/blob/ac916d2dadc132cb78268ca8ab17210d1bec3777/py-polars/polars/io.py#L812

We use pyarrow for memory mapping and AFAICT we only take that branch if use_pyarrow=True

ritchie46 avatar Aug 03 '22 12:08 ritchie46

Thanks for the fast reply. It seems read_parquet is not doing much with memory_map now that I look at it (you have linked to read_ipc). To be clear:

use_pyarrow memory_map result
True True error
False True error
True False success
False False error
True None error
False None error

(last 2 entries are explained by memory_map being True by default)

dkreeft avatar Aug 03 '22 12:08 dkreeft

Oh sorry, I thought we were talking about IPC. For parquet we simply memory map the whole file and then copy it into memory. We do the same with the csv reader. Does that give you any problems as well?

ritchie46 avatar Aug 03 '22 15:08 ritchie46

@dkreeft could you share your file?

ritchie46 avatar Aug 03 '22 19:08 ritchie46

Unfortunately I cannot not share the file as it is a report of our Azure storage account, which contains confidential information. I tried to anonymize it, but when I read the file using pyarrow and use the parquet.write_table method, the newly written parquet is readable with polars. Is there another way to anonymize the data/delete the rows without changing the metadata of the file? Or can I somehow generate example files using the schema? The difference between both files when running parquet-tools inspect is (besides differences in compression percentages):

new file

############ file meta data ############
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 9
num_rows: 206370
num_row_groups: 1
format_version: 1.0
serialized_size: 2067

original file

############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 9
num_rows: 206370
num_row_groups: 3
format_version: 1.0
serialized_size: 4048

Note the difference in num_row_groups and serialized_size. Perhaps those differences lead to the shown behavior?

edit: using polars to read the parquet file and writing it right away, leads to the following output for parquet-tools inspect:

############ file meta data ############
created_by: Arrow2 - Native Rust implementation of Arrow
num_columns: 9
num_rows: 206370
num_row_groups: 1
format_version: 2.6
serialized_size: 1673


############ Columns ############
Name
Creation-Time
Content-Length
Content-Type
hdi_isfolder
Owner
Group
Permissions
Acl

############ Column(Name) ############
name: Name
path: Name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 76%)

############ Column(Creation-Time) ############
name: Creation-Time
path: Creation-Time
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: LZ4 (space_saved: 62%)

############ Column(Content-Length) ############
name: Content-Length
path: Content-Length
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: LZ4 (space_saved: 98%)

############ Column(Content-Type) ############
name: Content-Type
path: Content-Type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)

############ Column(hdi_isfolder) ############
name: hdi_isfolder
path: hdi_isfolder
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
compression: LZ4 (space_saved: 10%)

############ Column(Owner) ############
name: Owner
path: Owner
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)

############ Column(Group) ############
name: Group
path: Group
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)

############ Column(Permissions) ############
name: Permissions
path: Permissions
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)

############ Column(Acl) ############
name: Acl
path: Acl
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: LZ4 (space_saved: 99%)

This file can also be read without problems using polars.

dkreeft avatar Aug 04 '22 08:08 dkreeft

I will close this as we cannot do anything without being able to reproduce this. Feel free to open a new issue when you have something we can reproduce.

ritchie46 avatar Oct 21 '22 18:10 ritchie46