xgboost Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types)

We have a a issue with upgrading XGBoost from 2.0.3 to 2.1.0. We use arrow backed types for our pandas dataframes, and if there are boolean columns (1 bit per element, saves memory for our pandas manipulations), we can't create DMatrix form the dataframe.

Repro code:


import pandas as pd
import pyarrow as pa
import xgboost

tab1 = pa.table({"a": pa.array([1,2,3,4]), "b":([True, False, False, True])})
df1 = tab1.to_pandas(types_mapper=pd.ArrowDtype)

xgboost.DMatrix(df1)

In 2.0.3, this constructs a DMatrix, but in 2.1.0, this raises an error ( last few frames of the stack trace):

    605     raise ValueError(f"DataFrame for {meta} cannot have multiple columns")
    607 feature_names, feature_types = pandas_feature_info(
    608     data, meta, feature_names, feature_types, enable_categorical
    609 )
--> 611 arrays = pandas_transform_data(data)
    612 return PandasTransformed(arrays), feature_names, feature_types

File <install_path>/python3.11/site-packages/xgboost/data.py:540, in pandas_transform_data(data)
    538     result.append(cat_codes(data[col]))
    539 elif is_pa_ext_dtype(dtype):
--> 540     result.append(pandas_pa_type(data[col]))
    541 elif is_nullable_dtype(dtype):
    542     result.append(nu_type(data[col]))

File <install_path>/python3.11/site-packages/xgboost/data.py:468, in pandas_pa_type(ser)
    461 zero_copy = chunk.null_count == 0
    462 # Alternately, we can use chunk.buffers(), which returns a list of buffers and
    463 # we need to concatenate them ourselves.
    464 # FIXME(jiamingy): Is there a better way to access the arrow buffer along with
    465 # its mask?
    466 # Buffers from chunk.buffers() have the address attribute, but don't expose the
    467 # mask.
--> 468 arr: np.ndarray = chunk.to_numpy(zero_copy_only=zero_copy, writable=False)
    469 arr, _ = _ensure_np_dtype(arr, arr.dtype)
    470 return arr

File <install_path>/python3.11/site-packages/pyarrow/array.pxi:1587, in pyarrow.lib.Array.to_numpy()

File <install_path>/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Zero copy conversions not possible with boolean types

Environment: Python 3.11.6 OS : Darwin machine : arm64 pandas : 2.2.2 numpy : 1.24.4 pyarrow : 16.1.0

Jun 29 '24 17:06 cvm-a

If I try to create a DMatrix directly from the pyarrow table, we get the same "ArrowInvalid: Zero copy conversions not possible with boolean types" in 2.1.0, but in 2.0.3 we get an error

File <installpath>/python3.11/site-packages/xgboost/data.py:1118, in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical, data_split_mode)
   1114     return _from_pandas_series(
   1115         data, missing, threads, enable_categorical, feature_names, feature_types
   1116     )
   1117 if _is_arrow(data):
-> 1118     return _from_arrow(
   1119         data, missing, threads, feature_names, feature_types, enable_categorical
   1120     )
   1121 if _has_array_protocol(data):
   1122     array = np.asarray(data)

File <installpath>/python3.11/site-packages/xgboost/data.py:737, in _from_arrow(data, missing, nthread, feature_names, feature_types, enable_categorical)
    732 import pyarrow as pa
    734 if not all(
    735     pa.types.is_integer(t) or pa.types.is_floating(t) for t in data.schema.types
    736 ):
--> 737     raise ValueError(
    738         "Features in dataset can only be integers or floating point number"
    739     )
    740 if enable_categorical:
    741     raise ValueError("categorical data in arrow is not supported yet.")

ValueError: Features in dataset can only be integers or floating point number

Pyarrow bools are packed, and they need to be unpacked from 1-bit bools to 1 byte bools for numpy

Jun 29 '24 17:06 cvm-a

Thank you for sharing.

This test https://github.com/dmlc/xgboost/blob/09d32f1f2b37e13f0d56e025ccecddf1fa9db76d/python-package/xgboost/testing/data.py#L206 https://github.com/dmlc/xgboost/blob/09d32f1f2b37e13f0d56e025ccecddf1fa9db76d/tests/python/test_with_pandas.py#L512 is not doing what it's supposed to be doing.

Jun 30 '24 07:06 trivialfis