ibis
ibis copied to clipboard
bug: different treatment of int columns in Pandas vs. Pyspark backend
If I create an ibis TableExpr that I mutate with an IntegerColumn (dype=int8), the dtype of the resulting materialized pandas DataFrame is different between the Pandas backend and the Pyspark backend. For example:
import pandas as pd
df = pd.DataFrame({'value': [1, 2, 3]})
client = ibis.pandas.connect({'table': df})
table = client.table('table')
expr = table.mutate(v=7)
expr.schema() ## prints ibis.Schema({value int64, v int8})
result = expr.execute()
result.dtypes ## prints value int64, v int64
That is, the dtype=int8 of the assigned IntegerColumn is ultimately converted to int64 in the resulting DataFrame.
Executing the same on the Pyspark backend, and then calling .toPandas() on the result, gives dtype=int32 for column v.
It seems there should be a consistent dtype returned for column v between the two execution backends.
To address this is a breaking change, since the results of .execute() will change. Moving to 4.0.0.