datatable fread string with NAs generates extra distinct group

fread string with NAs generates extra distinct group

Open jangorecki opened this issue 4 years ago • 0 comments

wget https://raw.githubusercontent.com/h2oai/db-benchmark/cf255c174647ac437aa7a85751f6e65732a3cb9a/_data/groupby-datagen.R
Rscript groupby-datagen.R 1e9 1e2 5 0

## activate your pydt env
source ~/git/db-benchmark/pydatatable/py-pydatatable/bin/activate
python

import datatable as dt
from datatable import f, count
x = dt.fread('G1_1e9_1e2_5_0.csv', na_strings=[''])
print(x.nrows, flush=True)
#1000000000
x[f.id1=="", count()]                                                                                                           
#   |   count
#-- + -------
# 0 | 2501132
#
#[1 row x 1 column]
x[isna(f.id1), count()]
#   |    count
#-- + --------
# 0 | 47505964
#
#[1 row x 1 column]

grep "^," G1_1e9_1e2_5_0.csv | wc -l
#50007096

Note that this issue only manifests on 1e9 data size and not on smaller sizes. Smaller sizes have exactly same number of distinct groups for this column.

What was the expected behavior?

Properly read NAs.

Your environment?

pydt 9bc7d05db2e35a480fb9aea7b570c1005776ae4d python 3.6.7 ubuntu 16.04

Dec 15 '20 11:12 jangorecki

datatable datatable copied to clipboard

fread string with NAs generates extra distinct group

datatable
datatable copied to clipboard