NimbusML icon indicating copy to clipboard operation
NimbusML copied to clipboard

Handler does not process large integers correctly

Open pieths opened this issue 5 years ago • 1 comments

import numpy as np
import pandas as pd
from nimbusml import FileDataStream
from nimbusml.preprocessing.missing_values import Handler

# 1 less than the maximum positive int32 value
# See: https://docs.scipy.org/doc/numpy/user/basics.types.html
large_int = 2147483646
with_nans = pd.DataFrame(data=dict( c1=[3, large_int, 5, 4])).astype(np.int32)

nahandle = Handler(replace_with='Mean') << 'c1'

result = nahandle.fit_transform(with_nans)
result = result.astype(np.int32)

print(result)
print(result.dtypes)
print(result.loc[1, 'c1.c1'])
print(result.loc[1, 'c1.c1'] == large_int)

The last line prints False and the value returned is -2147483648.

This does work if the number is small enough to accurately fit in a float32 (ie. large_int = 21474836).

This looks like it fails because the Handler transform implicitly converts its inputs in to float32. Any values which cannot be represented by float32 will not work correctly with this transform.

pieths avatar Sep 23 '19 20:09 pieths

Hi! I’m new to open source and I’d like to take on this task along with #269 over the next couple of weeks. Is that alright?

pnshinde avatar Nov 18 '19 02:11 pnshinde