NimbusML
NimbusML copied to clipboard
Handler does not process large integers correctly
import numpy as np
import pandas as pd
from nimbusml import FileDataStream
from nimbusml.preprocessing.missing_values import Handler
# 1 less than the maximum positive int32 value
# See: https://docs.scipy.org/doc/numpy/user/basics.types.html
large_int = 2147483646
with_nans = pd.DataFrame(data=dict( c1=[3, large_int, 5, 4])).astype(np.int32)
nahandle = Handler(replace_with='Mean') << 'c1'
result = nahandle.fit_transform(with_nans)
result = result.astype(np.int32)
print(result)
print(result.dtypes)
print(result.loc[1, 'c1.c1'])
print(result.loc[1, 'c1.c1'] == large_int)
The last line prints False
and the value returned is -2147483648.
This does work if the number is small enough to accurately fit in a float32 (ie. large_int = 21474836
).
This looks like it fails because the Handler transform implicitly converts its inputs in to float32. Any values which cannot be represented by float32 will not work correctly with this transform.
Hi! I’m new to open source and I’d like to take on this task along with #269 over the next couple of weeks. Is that alright?