ibis bug: different treatment of int columns in Pandas vs. Pyspark backend

bug: different treatment of int columns in Pandas vs. Pyspark backend

Open emilyreff7 opened this issue 5 years ago • 1 comments

If I create an ibis TableExpr that I mutate with an IntegerColumn (dype=int8), the dtype of the resulting materialized pandas DataFrame is different between the Pandas backend and the Pyspark backend. For example:

import pandas as pd df = pd.DataFrame({'value': [1, 2, 3]}) client = ibis.pandas.connect({'table': df}) table = client.table('table') expr = table.mutate(v=7) expr.schema() ## prints ibis.Schema({value int64, v int8}) result = expr.execute() result.dtypes ## prints value int64, v int64

That is, the dtype=int8 of the assigned IntegerColumn is ultimately converted to int64 in the resulting DataFrame.

Executing the same on the Pyspark backend, and then calling .toPandas() on the result, gives dtype=int32 for column v.

It seems there should be a consistent dtype returned for column v between the two execution backends.

Jan 30 '20 22:01 emilyreff7

To address this is a breaking change, since the results of .execute() will change. Moving to 4.0.0.

Jul 01 '22 16:07 cpcloud

ibis ibis copied to clipboard

bug: different treatment of int columns in Pandas vs. Pyspark backend

ibis
ibis copied to clipboard