polars icon indicating copy to clipboard operation
polars copied to clipboard

fix(python): ensure pyarrow.compute module is loaded

Open josh opened this issue 2 years ago • 2 comments

fix(python): ensure pyarrow.compute module is loaded

Stumbled across a pyarrow lazy loading race condition where pa.compute functions may not be available just yet. It's difficult to test in the test suite since another test may have triggered the module to be fully loaded hiding the bug.

I believe the pyarrow docs recommend importing and using the compute module directly rather than depending on them to be loaded on the root package. This change adds an explicit lazy load dependency for that pyarrow.compute module.

Reproduction Steps

import pyarrow as pa
import pyarrow.feather as feather

col = pa.chunked_array([["foo"], ["bar"]], type=pa.dictionary(pa.int8(), pa.string()))
table = pa.table([col], names=["a"])
feather.write_feather(table, "example.ipc")
import polars as pl
# import pyarrow.compute # enable workaround

pl.read_ipc("example.ipc", use_pyarrow=True)
Traceback (most recent call last):
  File "example.py", line 5, in <module>
    pl.read_ipc("example.ipc", use_pyarrow=True)
  File "polars/utils.py", line 394, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "polars/io.py", line 860, in read_ipc
    df = DataFrame._from_arrow(tbl, rechunk=rechunk)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/internals/dataframe/frame.py", line 470, in _from_arrow
    return cls._from_pydf(arrow_to_pydf(data, columns=columns, rechunk=rechunk))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/internals/construction.py", line 936, in arrow_to_pydf
    column = coerce_arrow(column)
             ^^^^^^^^^^^^^^^^^^^^
  File "polars/internals/construction.py", line 1105, in coerce_arrow
    array = pa.compute.cast(
            ^^^^^^^^^^
  File "polars/dependencies.py", line 82, in __getattr__
    return getattr(module, attr)
           ^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/__init__.py", line 335, in __getattr__
    raise AttributeError(
AttributeError: module 'pyarrow' has no attribute 'compute'

josh avatar Jan 21 '23 04:01 josh

@alexander-beedie could you take a look if this still makes sense regarding the lazy loading?

ritchie46 avatar Jan 21 '23 13:01 ritchie46

@alexander-beedie could you take a look if this still makes sense regarding the lazy loading?

No problem; I have a block of time tomorrow afternoon 👍

alexander-beedie avatar Jan 21 '23 13:01 alexander-beedie

could you take a look if this still makes sense regarding the lazy loading?

I guess another option would just putting the import pyarrow.compute right inline the coerce_arrow body since it's only ever used there.

josh avatar Jan 21 '23 18:01 josh

could you take a look if this still makes sense regarding the lazy loading?

I guess another option would just putting the import pyarrow.compute right inline the coerce_arrow body since it's only ever used there.

I like that more. Could you make this change?

ritchie46 avatar Jan 21 '23 18:01 ritchie46

All looks good to me; does seem that pyarrow wants that explicitly imported, but unless we're going to have more than one such import I think it's fine to special-case it and import inline.

alexander-beedie avatar Jan 22 '23 07:01 alexander-beedie

Great! Thanks @josh and @alexander-beedie

ritchie46 avatar Jan 22 '23 08:01 ritchie46