datatable
datatable copied to clipboard
fread string with NAs generates extra distinct group
wget https://raw.githubusercontent.com/h2oai/db-benchmark/cf255c174647ac437aa7a85751f6e65732a3cb9a/_data/groupby-datagen.R
Rscript groupby-datagen.R 1e9 1e2 5 0
## activate your pydt env
source ~/git/db-benchmark/pydatatable/py-pydatatable/bin/activate
python
import datatable as dt
from datatable import f, count
x = dt.fread('G1_1e9_1e2_5_0.csv', na_strings=[''])
print(x.nrows, flush=True)
#1000000000
x[f.id1=="", count()]
# | count
#-- + -------
# 0 | 2501132
#
#[1 row x 1 column]
x[isna(f.id1), count()]
# | count
#-- + --------
# 0 | 47505964
#
#[1 row x 1 column]
grep "^," G1_1e9_1e2_5_0.csv | wc -l
#50007096
Note that this issue only manifests on 1e9 data size and not on smaller sizes. Smaller sizes have exactly same number of distinct groups for this column.
- What was the expected behavior?
Properly read NAs.
- Your environment?
pydt 9bc7d05db2e35a480fb9aea7b570c1005776ae4d python 3.6.7 ubuntu 16.04