pandas
pandas copied to clipboard
pd.cut returning incorrect output in some cases
(Below is from master)
import numpy as np
import pandas as pd
arr = np.arange(10).astype(object)
arr[::2] = np.nan
print(arr)
# [nan 1 nan 3 nan 5 nan 7 nan 9]
result = pd.cut(arr, 2)
print(result)
# [NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0]]
# Categories (2, interval[float64]): [(0.992, 5.0] < (5.0, 9.0]]
print(result.unique())
# [NaN, (0.992, 5.0]]
# Categories (1, interval[float64]): [(0.992, 5.0]]
Using cut with an array of object dtype containing missing values seems to return the wrong intervals in some cases (e.g., in the example above only the first interval appears in the result). Actually, the only situation where I've been able to reproduce this problem is specifically when the NaN values are evenly spaced, which is strange.
Looks like the problem is due to searchsorted in numpy:
import numpy as np
arr = np.array([1, 2, 3, 4, 5], dtype=object)
arr[::2] = np.nan
print(arr)
# [nan 2 nan 4 nan]
bins = np.array([1, 3, 5])
# Inserts into same position (incorrect)
bins.searchsorted(arr)
# array([0, 1, 0, 1, 0])
# Now inserts into different positions (correct)
bins.searchsorted(arr.astype(float))
# array([3, 1, 3, 2, 3])
np.__version__
# '1.17.5'
df.cut method is producing the NaN's of float types. They check True with np.is_nan(), but neither of df.fillna() or df.replace(float('nan'), replacement_value) can't replace it!