iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Pyarrow data type, default to small type and fix large type override

Open kevinjqliu opened this issue 9 months ago • 0 comments

Rationale for this change

#1669 made the change to infer the type when reading, and not default pyarrow data types to the large type. Originally, default to large type was introduced by #986.

I found a bug in #1669 where type promotion from string->binary defaults to large_binary (https://github.com/apache/iceberg-python/pull/1669#discussion_r2017223767). Which led to to find that we still use large type in _ConvertToArrowSchema. Furthermore, I found that we did not respect PYARROW_USE_LARGE_TYPES_ON_READ=True when reading.

This PR is a continuation of #1669.

  • Change docs for pyarrow.use-large-types-on-read to default value False
  • Change _ConvertToArrowSchema to use small data type instead of large
  • When PYARROW_USE_LARGE_TYPES_ON_READ is enabled (set to True), ArrowScan and ArrowProjectionVisitor and should cast to large type
  • Add back test for setting PYARROW_USE_LARGE_TYPES_ON_READ to True

This PR should help us infer the data type when reading while keeping the PYARROW_USE_LARGE_TYPES_ON_READ override behavior until deprecation.

Are these changes tested?

Yes

Are there any user-facing changes?

No

kevinjqliu avatar Mar 27 '25 21:03 kevinjqliu