arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++][Parquet] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20

Open cjrh opened this issue 6 months ago • 2 comments

Describe the bug, including details regarding any error messages, version, and platform.

This issue was first posed as a question in #30302. I repeat the text here for convenience.

Overview

I have a directory of parquet files. For a specific categorical column, some parquet files use int8 and some use int16. In pyarrow 19.0.1, reading the directory as a dataset succeeds. But with pyarrow 20, it fails with the below error when loading data from the dataset directory

Reader code (python)

Either

        import pandas as pd
        df = pd.read_parquet(
            path,
            engine="pyarrow",
        )

or

    import pyarrow.dataset as dataset
    dataset = dataset.dataset(path, format="parquet")
    table = dataset.to_table()
    df = table.to_pandas()

Traceback

File "/app/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1475, in read
  table = self._dataset.to_table(
          ^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3941, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 731 not in range: -128 to 127

Parquet Metadata

This shows the dictionary info of the parquet files in the directory:

>>> import pyarrow.dataset as dataset
>>> ds = dataset(path)
>>> for path in ds.files:
...     sch = pq.read_schema(path)
...     print(path, sch.field('ExpStartDate').type)
... 
dataframes.parq/00eac90ef2f504223a74498405e060a48.parquet dictionary<values=string, indices=int8, ordered=0>
dataframes.parq/0641c30f725cd448bafc335d36cd01f6b.parquet dictionary<values=string, indices=int16, ordered=0>
dataframes.parq/0cb2799478dd54c738efe76fdc1875326.parquet dictionary<values=string, indices=int8, ordered=0>
dataframes.parq/0cff477be69be4ee093d98728d4f84452.parquet dictionary<values=string, indices=int16, ordered=0>
dataframes.parq/0d103de6323904e93aecf24589c12a370.parquet dictionary<values=string, indices=int8, ordered=0>

Is my issue related to the change in #30302 ? Is there a way to restore the previous behaviour of upcasting to int32 on read? Or what is the preferred workaround? It is going to be quite tedious to have to force all my writes to use int32, and especially for migrating huge volumes of historical data. For now we remain on pyarrow 19.0.1, but at some point we would like to upgrade.

Component(s)

C++, Parquet, Python

cjrh avatar May 28 '25 14:05 cjrh

Normally, dataset tries to normalize schemas when reading the files in a dataset. Apparently that doesn't work for dictionary types, we should fix this.

pitrou avatar May 29 '25 07:05 pitrou

Hi, I'm taking a look at this issue and had a few questions:

  1. What are some situations where schemas are normalized when reading the files? The only FragmentEvolutionStrategy I have found is BasicFragmentEvolution which dosen't handle type promotions. Or are you talking about the call to UnifySchemas in DatasetFactory::Inspect?
  2. If so, it looks like the C++ API already supports this:
std::string path1 = "dataset/int8.parquet"; // value: dictionary<values=string, indices=int8, ordered=0>
std::string path2 = "dataset/int16.parquet"; // value: dictionary<values=string, indices=int16, ordered=0>

auto factory = FileSystemDatasetFactory::Make(
    std::make_shared<arrow::fs::LocalFileSystem>(), {path1, path2},
    std::make_shared<ParquetFileFormat>(), FileSystemFactoryOptions{}).ValueOrDie();
  
InspectOptions options;
options.fragments = InspectOptions::kInspectAllFragments;
options.field_merge_options = Field::MergeOptions::Permissive();
auto schema = factory->Inspect(options).ValueOrDie(); // value: dictionary<values=string, indices=int16, ordered=0>

auto dataset = factory->Finish(schema).ValueOrDie();
auto scanner = dataset->NewScan().ValueOrDie()->Finish().ValueOrDie();
auto table = scanner->ToTable().ValueOrDie(); // value: dictionary<values=string, indices=int16, ordered=0>

It seems like doing it this way in Python is currently impossible because the FileSystemDatasetFactory.inspect method does not take an options argument.

hadrian-reppas avatar Jun 23 '25 20:06 hadrian-reppas

Or are you talking about the call to UnifySchemas in DatasetFactory::Inspect?

It should be this, indeed.

It seems like doing it this way in Python is currently impossible because the FileSystemDatasetFactory.inspect method does not take an options argument.

You're right, it should be exposed. Do you want to try and contribute this?

(however, it's a bit surprising that it would have worked in 19.0.1, then)

pitrou avatar Jun 30 '25 15:06 pitrou

I'll give it a shot. Instead of exposing InspectOptions, I think it could be better to just add two parameters to FileSystemDatasetFactory.inspect:

  • promote_options='default' which can also be set to 'permissive' (same as unify_schemas).
  • fragments=None which can also be set to a non-negative integer. C++ defaults to only checking one fragment for performance reasons, but the Python documentation says "Inspect all data fragments and return a common Schema."

hadrian-reppas avatar Jun 30 '25 16:06 hadrian-reppas

Hi @hadrian-reppas, thanks for taking this on. It would be nice to include this in the Arrow 21 release which we're working on finalizing now. Do you think you can get a PR in for this soon?

amoeba avatar Jun 30 '25 18:06 amoeba

I'm working on it now. I should be done soon.

hadrian-reppas avatar Jun 30 '25 18:06 hadrian-reppas

This commit allows me to write

factory = ds.FileSystemDatasetFactory(fs.LocalFileSystem(), [path1, path2], ds.ParquetFileFormat())
schema = factory.inspect("permissive")
dataset = factory.finish(schema)
table = dataset.to_table()

(just like C++ example above) and get the desired behavior. It also gives correct results for other parameters when I tested in the REPL. Two issues:

  • I can't figure out how to access the static constexpr kInspectAllFragments in InspectOptions, so I just hardcoded -1.
  • This branches off of maint-20.0.0 because when I branch off main I get the following error trying to import pyarrow.dataset:
>>> import pyarrow.dataset as ds
<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow.lib.IpcReadOptions size changed, may indicate binary incompatibility. Expected 104 from C header, got 112 from PyObject
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/hadrian/dev/arrow/python/pyarrow/dataset.py", line 85, in <module>
    from pyarrow._dataset_parquet import (  # noqa
  File "pyarrow/_dataset_parquet.pyx", line 57, in init pyarrow._dataset_parquet
    from pyarrow._dataset_parquet_encryption import (
  File "pyarrow/_dataset_parquet_encryption.pyx", line 1, in init pyarrow._dataset_parquet_encryption
    # Licensed to the Apache Software Foundation (ASF) under one
AttributeError: module 'pyarrow.lib' has no attribute 'PyExtensionType'. Did you mean: 'ExtensionType'?

hadrian-reppas avatar Jun 30 '25 19:06 hadrian-reppas

I'm having trouble writing tests because of the import issue but it's a pretty small change and it worked when I tested it in the REPL. I figured I would put up a PR with what I have.

hadrian-reppas avatar Jul 01 '25 15:07 hadrian-reppas

@hadrian-reppas I see, thank you. Perhaps someone can help you here? cc @AlenkaF @thisisnic

pitrou avatar Jul 01 '25 15:07 pitrou

I will have a look if I can reproduce the import failure later today, seems it is connected to https://github.com/apache/arrow/pull/46199.

AlenkaF avatar Jul 02 '25 04:07 AlenkaF

@AlenkaF @pitrou is this is a blocker for 21.0.0? We should add the blocker label if it is.

raulcd avatar Jul 09 '25 09:07 raulcd

I don't think so. It's a regression but unfortunately it was already in 20.0.0. The fix can go in a hypothetical 21.0.1 if it's merged by then.

pitrou avatar Jul 09 '25 09:07 pitrou

It's a regression but there is a workaround (use an explicit schema) so treating this as not-a-blocker makes sense to me.

amoeba avatar Jul 09 '25 14:07 amoeba

ok, I've removed the 21.0.0 milestone. If a solution is merged before a RC for 21.0.0 happens we could cherry-pick it but it is not a blocker as discussed.

raulcd avatar Jul 09 '25 14:07 raulcd

Issue resolved by pull request 46961 https://github.com/apache/arrow/pull/46961

pitrou avatar Jul 15 '25 12:07 pitrou

Thanks @hadrian-reppas and @pitrou for fixing this ❤️

For interest sake, if anyone else using pd.read_parquet (with the pyarrow engine) happens to find this issue before pandas adds support for the new "permissive" option, the wrapper below appears to work for me:

def read_parquet(dataset_dir, *args, **kwargs):
    selector = fs.FileSelector(str(dataset_dir), recursive=True)
    factory = ds.FileSystemDatasetFactory(
        fs.LocalFileSystem(), selector, ds.ParquetFileFormat()
    )
    schema = factory.inspect(promote_options="permissive")
    return pd.read_parquet(str(dataset_dir), *args, schema=schema, **kwargs)

cjrh avatar Oct 26 '25 18:10 cjrh