datafusion Allow providing Arrow schema when scanning Parquet files

Is your feature request related to a problem or challenge?

When scanning Parquet files, we'd often like to provide an expected schema, since:

The Parquet files might not all have an identical physical schema, but we may know the unified schema up front, such as when we are using a table format like Delta Lake.
A given Parquet type can map to many different Arrow types. For example, when possible it would be nice to read a string column as a DictionaryArray, especially if the data is already dictionary-encoded.

Describe the solution you'd like

It would be nice to be able to provide an Arrow schema in ParquetScanOptions, and then the scan would try in order:

Read the Parquet data directly into the output type
Read the Parquet data into a supported Arrow type then cast
Return an error stating which types can't be mapped

This probably would require some changes upstream in the parquet crate, but for at least some of the functionality datafusion seems like the right place. LMK if you think differently.

Describe alternatives you've considered

Right now our current issue is that our expected schema (from the Delta Lake log) doesn't match the physical schema (at least when written by Spark). So as a workaround we are looking at the metadata of one of the Parquet files. This will partially solve the issue, but will likely fail for tables where Spark wasn't the only engine to write to the table.

Additional context

Here's a simple example where PyArrow's scanner works but Datafusions doesn't seem to:

import pyarrow as pa
import pyarrow.parquet as pq

tab1 = pa.table({"a": pa.array([1, 2, 3, 4, 5], type=pa.int32())})
tab2 = pa.table({"a": pa.array([6, 7, 8, 9, 10], type=pa.int64())})

file1 = 'table/1.parquet'
file2 = 'table/2.parquet'

pq.write_table(tab1, file1)
pq.write_table(tab2, file2)

import pyarrow.dataset as ds

ds.dataset("table", format="parquet").to_table()

pyarrow.Table
a: int32
----
a: [[1,2,3,4,5],[6,7,8,9,10]]

from datafusion import SessionContext

# Create a DataFusion context
ctx = SessionContext()

# Register table with context
ctx.register_parquet('numbers', 'table')

Exception: DataFusion error: ArrowError(SchemaError("Fail to merge schema field 'a' because the from data_type = Int64 does not equal Int32"))

Apr 11 '23 01:04 wjones127

This should be fixed now by #10515. You can now override the schema used in the file scanner using the SchemaAdapter.

Jun 24 '24 01:06 HawaiianSpork

This should be fixed now by https://github.com/apache/datafusion/pull/10515. You can now override the schema used in the file scanner using the SchemaAdapter.

Doesn't the SchemaAdapter convert the schema that was already read? So it doesn't really solve the issue.

Does passing in a schema to FileScanConfig not work? Or is this request specifically for a Python API?

Mar 14 '25 05:03 adriangb

This should be fixed now by https://github.com/apache/datafusion/pull/10515. You can now override the schema used in the file scanner using the SchemaAdapter.

Doesn't the SchemaAdapter convert the schema that was already read? So it doesn't really solve the issue.

Does passing in a schema to FileScanConfig not work? Or is this request specifically for a Python API?

The file_schema in FileScanConfig can be used to coarse the schema read from parquet into the supplied schema using arrow cast. If, however, you need functionality beyond cast (for example to add columns that don't exist in some of the parquet files) than schemaAdapter can be used to convert the data returned before it is used by datafusion. This allows the extension of the parquet table provider. Otherwise, a new table provider would need to be created.

Mar 14 '25 12:03 HawaiianSpork