kartothek icon indicating copy to clipboard operation
kartothek copied to clipboard

Predicates incorrectly keep missings for `Float64` and `Int64` dtypes for pyarrow=4

Open mlondschien opened this issue 4 years ago • 1 comments

In [1]: from functools import partial
   ...: 
   ...: import minimalkv
   ...: import numpy as np
   ...: import pandas as pd
   ...: import pyarrow as pa
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "I": pd.array([0, 1, pd.NA], dtype="Int64"),
   ...:         "f": pd.array([0.0, 1.1, np.nan], dtype="float64"),
   ...:         "F": pd.array([0.0, 1.1, pd.NA], dtype="Float64"),
   ...:         "o_1": pd.array([0, 1, None], dtype="object"),
   ...:         "o_2": pd.array(["0", "1", None], dtype="object"),
   ...:         "s": pd.array(["0", "b", None], dtype="string"),
   ...:     }
   ...: )
   ...: df.dtypes
Out[1]: 
I        Int64
f      float64
F      Float64
o_1     object
o_2     object
s       string
dtype: object

In [2]: df.to_parquet("/tmp/file.parquet")
   ...: pa.parquet.read_table("/tmp/file.parquet").to_pandas()
Out[2]: 
      I    f     F  o_1   o_2     s
0     0  0.0   0.0  0.0     0     0
1     1  1.1   1.1  1.0     1     b
2  <NA>  NaN  <NA>  NaN  None  <NA>

In [3]: store = partial(minimalkv.get_store_from_url, f"hfs:///tmp?create_if_missing=False")
   ...: store_dataframes_as_dataset(dfs=[df], dataset_uuid="test", store=store, overwrite=True)
   ...: read_table(dataset_uuid="test", store=store)
Out[3]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1   1.1     1  1.1  1.0     1     b
2  <NA>  <NA>  NaN  NaN  None  <NA>

In [4]: # Int64
   ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "!=", 0)]]).to_pandas()
Out[4]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [5]: read_table(dataset_uuid="test", store=store, predicates=[[("I", "!=", 0)]])
Out[5]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [6]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "==", 0)]]).to_pandas()
Out[6]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [7]: read_table(dataset_uuid="test", store=store, predicates=[[("I", "==", 0)]])
Out[7]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1  <NA>  <NA>  NaN  NaN  None  <NA>

In [8]: # Float64                                             
   ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("F", "!=", 0.0)]]).to_pandas()
Out[8]: 
   I    f    F  o_1 o_2  s                                                                                                                                                                    
0  1  1.1  1.1    1   1  b     

In [10]: 
    ...: read_table(dataset_uuid="test", store=store, predicates=[[("F", "!=", 0.0)]])
Out[10]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [11]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("F", "==", 0.0)]]).to_pandas()
Out[11]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [12]: read_table(dataset_uuid="test", store=store, predicates=[[("F", "==", 0.0)]])
Out[12]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1  <NA>  <NA>  NaN  NaN  None  <NA>

In [15]: # float64
    ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("f", "!=", 0.0)]]).to_pandas()
Out[15]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [16]: read_table(dataset_uuid="test", store=store, predicates=[[("f", "!=", 0.0)]])
Out[16]: 
      F     I    f  o_1   o_2     s
0   1.1     1  1.1  1.0     1     b
1  <NA>  <NA>  NaN  NaN  None  <NA>

In [17]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("f", "==", 0.0)]]).to_pandas()
Out[17]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [18]: read_table(dataset_uuid="test", store=store, predicates=[[("f", "==", 0.0)]])
Out[18]: 
     F  I    f  o_1 o_2  s
0  0.0  0  0.0  0.0   0  0

In [19]: # string
    ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("s", "!=", "0")]]).to_pandas()
Out[19]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [20]: read_table(dataset_uuid="test", store=store, predicates=[[("s", "!=", "0")]])
Out[20]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [21]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("s", "==", "0")]]).to_pandas()
Out[21]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [22]: read_table(dataset_uuid="test", store=store, predicates=[[("s", "==", "0")]])
Out[22]: 
     F  I    f  o_1 o_2  s
0  0.0  0  0.0  0.0   0  0

mlondschien avatar Jun 25 '21 07:06 mlondschien