distython
distython copied to clipboard
HVDM in knn
Hello !
I'm trying to use the HVDM metric in KNN from sklearn, but unfortunately it keeps raising problems in the fitting part. I've tried to print things inside the code in order to understand what was going on but it only raises new questions.
I'm working on the "heart" dataset, and i'm doing some feature selection, and i try to see the accuracy of the model using an increasing number of features. So, i initialize the metric as follows :
hvdm_metric = hvdm(np.concatenate((X[:,features_subset],np.array([Y]).T),axis=1),[len(features_subset)], ind_map, nan_equivalents = [nan_eqv])
(so i first concatenate the data and the output as HVDM needs the output in the data, ind_map is just the mapping of the categorical features to the feature subset)
Then I get an error when using nn.fit(X_T[:,features_subset],Y_T)
: It says it tries to divide by 0. From every thing I had tried before, I thought it came from a numerical feature that was wrongly indicated as categorical, but unfortunately it is not that. It tries to compute the distance from each of my output to the mean of the ouputs, but the output is categorical so that's really weird. I guess it comes from the fact that my output Y is included in the data when i initialize the metric...
Do you have any working example of the use of HVDM in knn ? Or any idea on how to avoid that ? Thanks a lot for your package and you making me discover these metrics !
Hi, thanks for submitting the issue. Could you attach output log so I could look at it? If you could also provide a small script to reproduce the error with a data sample it would be great!
I am planning to refurbish Distython a bit in a month or so after my dissertation and exams :) I am happy to put into my ToDo list as soon as we will figure out what;s going on.
Just so you know, i use a KNN Classifier here. The log for my script is here (i do not directly use the functions from distython because i want to apply some factor on the missing data result, and not necessarily use 1, but otherwise the function is just the same) :
Division by zero is not allowed! Traceback (most recent call last):
File "D:\Bureau\Ecole\Mémoire\Master-thesis\Dourt\test_heart_hvdm.py", line 104, in
File "C:\Users\nicol\Anaconda3\lib\site-packages\sklearn\neighbors_base.py", line 1154, in fit return self._fit(X)
File "C:\Users\nicol\Anaconda3\lib\site-packages\sklearn\neighbors_base.py", line 452, in fit **self.effective_metric_params)
File "sklearn\neighbors_binary_tree.pxi", line 1106, in sklearn.neighbors._ball_tree.BinaryTree.init
File "sklearn\neighbors_binary_tree.pxi", line 1239, in sklearn.neighbors._ball_tree.BinaryTree._recursive_build
File "sklearn\neighbors_ball_tree.pyx", line 98, in sklearn.neighbors._ball_tree.init_node
File "sklearn\neighbors_binary_tree.pxi", line 1217, in sklearn.neighbors._ball_tree.BinaryTree.rdist
File "sklearn\neighbors_dist_metrics.pyx", line 312, in sklearn.neighbors._dist_metrics.DistanceMetric.rdist
File "sklearn\neighbors_dist_metrics.pyx", line 1108, in sklearn.neighbors._dist_metrics.PyFuncDistance.dist
File "sklearn\neighbors_dist_metrics.pyx", line 1116, in sklearn.neighbors._dist_metrics.PyFuncDistance._dist
File "../utils\HVDM_distython.py", line 76, in hvdm results_array[cat_ix] = super().vdm(x, y, nan_ix)[cat_ix]
File "../utils\VDM_distython.py", line 103, in vdm result[i] = temp_result
UnboundLocalError: local variable 'temp_result' referenced before assignment
(sorry the code framework isn't working, i don't know why ?)
Now, for a small datasample and a basic script :
import numpy as np from distython import HVDM from sklearn.neighbors import KNeighborsClassifier
x = np.random.randn(50) y = (np.cos(x)<0.5)*1 + (np.cos(x)<0.8)*1 rx = np.round(x) data = np.array([x,rx,y]).T
hvdm_metric = HVDM(data, [2], [1],nan_equivalents=[1234567])
train = range(30) nn = KNeighborsClassifier(n_neighbors=5,metric = hvdm_metric.hvdm) nn.fit(data[train,0:1],data[train,2])
So I have a random x vector, I make a categorical feature out of it so that we have 2 features, and I make a categorical out of the vector x. Then I try to train the KNN model when using fit, here's the log :
runfile('D:/Bureau/Ecole/Mémoire/Master-thesis/Dourt/sanstitre0.py', wdir='D:/Bureau/Ecole/Mémoire/Master-thesis/Dourt') Traceback (most recent call last):
File "D:\Bureau\Ecole\Mémoire\Master-thesis\Dourt\sanstitre0.py", line 21, in
File "C:\Users\nicol\Anaconda3\lib\site-packages\sklearn\neighbors_base.py", line 1154, in fit return self._fit(X)
File "C:\Users\nicol\Anaconda3\lib\site-packages\sklearn\neighbors_base.py", line 452, in fit **self.effective_metric_params)
File "sklearn\neighbors_binary_tree.pxi", line 1106, in sklearn.neighbors._ball_tree.BinaryTree.init
File "sklearn\neighbors_binary_tree.pxi", line 1239, in sklearn.neighbors._ball_tree.BinaryTree._recursive_build
File "sklearn\neighbors_ball_tree.pyx", line 98, in sklearn.neighbors._ball_tree.init_node
File "sklearn\neighbors_binary_tree.pxi", line 1217, in sklearn.neighbors._ball_tree.BinaryTree.rdist
File "sklearn\neighbors_dist_metrics.pyx", line 312, in sklearn.neighbors._dist_metrics.DistanceMetric.rdist
File "sklearn\neighbors_dist_metrics.pyx", line 1108, in sklearn.neighbors._dist_metrics.PyFuncDistance.dist
File "sklearn\neighbors_dist_metrics.pyx", line 1116, in sklearn.neighbors._dist_metrics.PyFuncDistance._dist
File "C:\Users\nicol\Anaconda3\lib\site-packages\distython\HVDM.py", line 80, in hvdm results_array[cat_ix] = super().vdm(x, y, nan_ix)[cat_ix]
File "C:\Users\nicol\Anaconda3\lib\site-packages\distython\VDM.py", line 89, in vdm x_ix = np.argwhere(self.unique_attributes[:, i] == x[i]).flatten()
IndexError: index 1 is out of bounds for axis 0 with size 1
I have tried using nn.fit(data[train,:],data[train,2]) and when doing that the error is there : `File "C:\Users\nicol\Anaconda3\lib\site-packages\distython\VDM.py", line 101, in vdm result[i] = temp_result
UnboundLocalError: local variable 'temp_result' referenced before assignment`
It's the same as when I use my more complete script.
Thanks a lot for helping me, and good luck for your exams, i'm in the same boat !