rrcf icon indicating copy to clipboard operation
rrcf copied to clipboard

Feature Request - Handling of NaNs

Open shaye059 opened this issue 3 years ago • 0 comments

Currently, having any NaN values in the numpy array leads to the following error when trying to build a RCTree:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-8f8be9d6cf46> in <module>
----> 1 tree = rrcf.RCTree(data_anom.sample(1000, random_state=111).to_numpy())

~\anaconda3\envs\squarefeetenv\lib\site-packages\rrcf\rrcf.py in __init__(self, X, index_labels, precision, random_state)
    104             # Create RRC Tree
    105             S = np.ones(n, dtype=np.bool)
--> 106             self._mktree(X, S, N, I, parent=self)
    107             # Remove parent of root
    108             self.root.u = None

~\anaconda3\envs\squarefeetenv\lib\site-packages\rrcf\rrcf.py in _mktree(self, X, S, N, I, parent, side, depth)
    170         depth += 1
    171         # Create a cut according to definition 1
--> 172         S1, S2, branch = self._cut(X, S, parent=parent, side=side)
    173         # If S1 does not contain an isolated point...
    174         if S1.sum() > 1:

~\anaconda3\envs\squarefeetenv\lib\site-packages\rrcf\rrcf.py in _cut(self, X, S, parent, side)
    152         l /= l.sum()
    153         # Determine dimension to cut
--> 154         q = self.rng.choice(self.ndim, p=l)
    155         # Determine value for split
    156         p = self.rng.uniform(xmin[q], xmax[q])

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: probabilities contain NaN

Filling NaNs with mean or median column values is probably the best way to handle this so perhaps having it as a built-in option would be helpful. Maybe it could be an optional parameter during the creation of a RCTree with the default handling set to None?

shaye059 avatar Mar 01 '21 15:03 shaye059