[C++][Parquet] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20
Describe the bug, including details regarding any error messages, version, and platform.
This issue was first posed as a question in #30302. I repeat the text here for convenience.
Overview
I have a directory of parquet files. For a specific categorical column, some parquet files use int8 and some use int16. In pyarrow 19.0.1, reading the directory as a dataset succeeds. But with pyarrow 20, it fails with the below error when loading data from the dataset directory
Reader code (python)
Either
import pandas as pd
df = pd.read_parquet(
path,
engine="pyarrow",
)
or
import pyarrow.dataset as dataset
dataset = dataset.dataset(path, format="parquet")
table = dataset.to_table()
df = table.to_pandas()
Traceback
File "/app/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1475, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3941, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 731 not in range: -128 to 127
Parquet Metadata
This shows the dictionary info of the parquet files in the directory:
>>> import pyarrow.dataset as dataset
>>> ds = dataset(path)
>>> for path in ds.files:
... sch = pq.read_schema(path)
... print(path, sch.field('ExpStartDate').type)
...
dataframes.parq/00eac90ef2f504223a74498405e060a48.parquet dictionary<values=string, indices=int8, ordered=0>
dataframes.parq/0641c30f725cd448bafc335d36cd01f6b.parquet dictionary<values=string, indices=int16, ordered=0>
dataframes.parq/0cb2799478dd54c738efe76fdc1875326.parquet dictionary<values=string, indices=int8, ordered=0>
dataframes.parq/0cff477be69be4ee093d98728d4f84452.parquet dictionary<values=string, indices=int16, ordered=0>
dataframes.parq/0d103de6323904e93aecf24589c12a370.parquet dictionary<values=string, indices=int8, ordered=0>
Is my issue related to the change in #30302 ? Is there a way to restore the previous behaviour of upcasting to int32 on read? Or what is the preferred workaround? It is going to be quite tedious to have to force all my writes to use int32, and especially for migrating huge volumes of historical data. For now we remain on pyarrow 19.0.1, but at some point we would like to upgrade.
Component(s)
C++, Parquet, Python
Normally, dataset tries to normalize schemas when reading the files in a dataset. Apparently that doesn't work for dictionary types, we should fix this.
Hi, I'm taking a look at this issue and had a few questions:
- What are some situations where schemas are normalized when reading the files? The only
FragmentEvolutionStrategyI have found isBasicFragmentEvolutionwhich dosen't handle type promotions. Or are you talking about the call toUnifySchemasinDatasetFactory::Inspect? - If so, it looks like the C++ API already supports this:
std::string path1 = "dataset/int8.parquet"; // value: dictionary<values=string, indices=int8, ordered=0>
std::string path2 = "dataset/int16.parquet"; // value: dictionary<values=string, indices=int16, ordered=0>
auto factory = FileSystemDatasetFactory::Make(
std::make_shared<arrow::fs::LocalFileSystem>(), {path1, path2},
std::make_shared<ParquetFileFormat>(), FileSystemFactoryOptions{}).ValueOrDie();
InspectOptions options;
options.fragments = InspectOptions::kInspectAllFragments;
options.field_merge_options = Field::MergeOptions::Permissive();
auto schema = factory->Inspect(options).ValueOrDie(); // value: dictionary<values=string, indices=int16, ordered=0>
auto dataset = factory->Finish(schema).ValueOrDie();
auto scanner = dataset->NewScan().ValueOrDie()->Finish().ValueOrDie();
auto table = scanner->ToTable().ValueOrDie(); // value: dictionary<values=string, indices=int16, ordered=0>
It seems like doing it this way in Python is currently impossible because the FileSystemDatasetFactory.inspect method does not take an options argument.
Or are you talking about the call to
UnifySchemasinDatasetFactory::Inspect?
It should be this, indeed.
It seems like doing it this way in Python is currently impossible because the FileSystemDatasetFactory.inspect method does not take an options argument.
You're right, it should be exposed. Do you want to try and contribute this?
(however, it's a bit surprising that it would have worked in 19.0.1, then)
I'll give it a shot. Instead of exposing InspectOptions, I think it could be better to just add two parameters to FileSystemDatasetFactory.inspect:
promote_options='default'which can also be set to'permissive'(same asunify_schemas).fragments=Nonewhich can also be set to a non-negative integer. C++ defaults to only checking one fragment for performance reasons, but the Python documentation says "Inspect all data fragments and return a common Schema."
Hi @hadrian-reppas, thanks for taking this on. It would be nice to include this in the Arrow 21 release which we're working on finalizing now. Do you think you can get a PR in for this soon?
I'm working on it now. I should be done soon.
This commit allows me to write
factory = ds.FileSystemDatasetFactory(fs.LocalFileSystem(), [path1, path2], ds.ParquetFileFormat())
schema = factory.inspect("permissive")
dataset = factory.finish(schema)
table = dataset.to_table()
(just like C++ example above) and get the desired behavior. It also gives correct results for other parameters when I tested in the REPL. Two issues:
- I can't figure out how to access the static constexpr
kInspectAllFragmentsinInspectOptions, so I just hardcoded-1. - This branches off of
maint-20.0.0because when I branch off main I get the following error trying to importpyarrow.dataset:
>>> import pyarrow.dataset as ds
<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow.lib.IpcReadOptions size changed, may indicate binary incompatibility. Expected 104 from C header, got 112 from PyObject
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/hadrian/dev/arrow/python/pyarrow/dataset.py", line 85, in <module>
from pyarrow._dataset_parquet import ( # noqa
File "pyarrow/_dataset_parquet.pyx", line 57, in init pyarrow._dataset_parquet
from pyarrow._dataset_parquet_encryption import (
File "pyarrow/_dataset_parquet_encryption.pyx", line 1, in init pyarrow._dataset_parquet_encryption
# Licensed to the Apache Software Foundation (ASF) under one
AttributeError: module 'pyarrow.lib' has no attribute 'PyExtensionType'. Did you mean: 'ExtensionType'?
I'm having trouble writing tests because of the import issue but it's a pretty small change and it worked when I tested it in the REPL. I figured I would put up a PR with what I have.
@hadrian-reppas I see, thank you. Perhaps someone can help you here? cc @AlenkaF @thisisnic
I will have a look if I can reproduce the import failure later today, seems it is connected to https://github.com/apache/arrow/pull/46199.
@AlenkaF @pitrou is this is a blocker for 21.0.0? We should add the blocker label if it is.
I don't think so. It's a regression but unfortunately it was already in 20.0.0. The fix can go in a hypothetical 21.0.1 if it's merged by then.
It's a regression but there is a workaround (use an explicit schema) so treating this as not-a-blocker makes sense to me.
ok, I've removed the 21.0.0 milestone. If a solution is merged before a RC for 21.0.0 happens we could cherry-pick it but it is not a blocker as discussed.
Issue resolved by pull request 46961 https://github.com/apache/arrow/pull/46961
Thanks @hadrian-reppas and @pitrou for fixing this ❤️
For interest sake, if anyone else using pd.read_parquet (with the pyarrow engine) happens to find this issue before pandas adds support for the new "permissive" option, the wrapper below appears to work for me:
def read_parquet(dataset_dir, *args, **kwargs):
selector = fs.FileSelector(str(dataset_dir), recursive=True)
factory = ds.FileSystemDatasetFactory(
fs.LocalFileSystem(), selector, ds.ParquetFileFormat()
)
schema = factory.inspect(promote_options="permissive")
return pd.read_parquet(str(dataset_dir), *args, schema=schema, **kwargs)