pyDML icon indicating copy to clipboard operation
pyDML copied to clipboard

Data is being mapped into the complex plane

Open aadharna opened this issue 3 years ago • 10 comments

Is there a way to force the distance-transforms that these algorithms learn to not map to the complex plane? It's also inconsistent. On some folds, of the tune_knn, I have no problem, but sometimes the algorithm sends the data to C^N

For several algorithms, I've received the following error message:

running CV on <class 'dml.anmm.ANMM'>
*** Tuning Case  {'num_dims': 2, 'n_friends': 1, 'n_enemies': 1} ...  
** FOLD  1  
** FOLD  2  
** FOLD  3  
Traceback (most recent call last):
  File "Perovskite_DistanceLearning.py", line 217, in <module>    mcml_results, mcml_best, mcml_best, mcml_detailed = tune_knn(ANMM,
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/tune.py", line 208, in tune_knn
    results = cross_validate(alg, X, y, n_folds=n_folds, n_reps=n_reps, verbose=verbose, seed=seed)  File "/home/jupyter/tacc-work/sd2nb/tune.py", line 82, in cross_validate
    alg.fit(X_train.real, y_train.real)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/pipeline.py", line 335,in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 1131, in fit
    X, y = self._validate_data(X, y, accept_sparse="csr",
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 796, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 608, in check_array
    _ensure_no_complex_data(array)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 394, in _ensure_no_complex_data
    raise ValueError("Complex data not supported\n"
ValueError: Complex data not supported
[[1.25992078e+04+0.j 4.53682237e+01+0.j]
 [1.25992083e+04+0.j 4.15542700e+01+0.j]
 [1.25992194e+04+0.j 1.35543520e+02+0.j]
 ...

aadharna avatar Jul 29 '20 23:07 aadharna

Hi, thank you for pointing this issue out. I'll get on it as soon as I can. Do you have a minimal working example that you are able to share to reproduce this error?

jlsuarezdiaz avatar Jul 30 '20 09:07 jlsuarezdiaz

The zip below contains three items.

  1. minimal57perovskite.csv this is the full dataset, it'll be loaded and then stratified, and the stratified version is the one I use.

  2. tune.py. I had to edit your tune file slightly because, in tune_knn, the following line was breaking:

ln 215: best_dml = dml(*(best_performance[0]), **dml_params)

I removed the dereference and the subscripting on best_performance to make it work. Previously, I kept getting that you cannot deference an int (which makes sense, since that position of the best_performance tuple was an argument-pointer to the position of the best score.). Perhaps you could explain this line though because it doesn't make sense to me why you'd want to pass in the score into the algorithm constructor.

  1. Perovskite_DistanceLearning.py this is a file that loads and runs the tune_knn on the perovskite dataset.

minimalFailure.zip

aadharna avatar Jul 30 '20 15:07 aadharna

The fix commit automatically closes the issue, but feel free to ask for reopen if there is still any problem. The problem was due to precision errors with a matrix that should be symmetric. That was causing the eigenvalue decomposition to obtain complex values. I have fixed it and tested in your code. Thank you for showing this and providing an example.

According to the issue you mentioned with the tune_knn function, it was already fixed in 5b7d57c. But it isn't released in PyPI yet. When the code was built the latest version of Pandas was still 0.x.x, and in these versions the argmax function returned the identifier of the row with the highest score. This identifier was actually a dictionary with the parameters of the DML for that best score. After the 1.x.x update of Pandas argmax started to return the positional (int) value of the row, instead of the ID. To fix this, argmaxwas replaced by idxmax in the line you mentioned. As I said, it is fixed but unreleased, anyways you can download the source code and it should work fine. It there is any other problem let me know.

jlsuarezdiaz avatar Jul 30 '20 19:07 jlsuarezdiaz

ANMM is working now!

Will do. For now, the other immediate case I have is that NCMML is breaking. I do remember another algorithm also giving me the complex-projection error, but I can't recall which it was at the moment. When I do, I'll let you know.

The only switch for reproducing the NCMML error is changing the search params in the already-linked-python-file to:

tune_knn(...
                 dml_params={'learning_rate':"adaptive"},
                 tune_args={'num_dims':[2, 3, 5, None], 
                                    'initial_transform':['euclidean', 'scale']},

The error message here is:

[insert uncountably many "Accuracy softmax error. Recalculating" calls]
Accuracy softmax error. Recalculating.
Accuracy softmax error. Recalculating.
Traceback (most recent call last):
  File "Perovskite_DistanceLearning.py", line 116, in <module>
    mcml_results, mcml_best, mcml_best, mcml_detailed = tune_knn(NCMML,
  File "/home/jupyter/tacc-work/sd2nb/ext/pyDML/dml/tune.py", line 208, in tune_knn
    results = cross_validate(alg, X, y, n_folds=n_folds, n_reps=n_reps, verbose=verbose, seed=seed)
  File "/home/jupyter/tacc-work/sd2nb/ext/pyDML/dml/tune.py", line 82, in cross_validate
    alg.fit(X_train, y_train)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/pipeline.py", line 335,in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 1131, in fit
    X, y = self._validate_data(X, y, accept_sparse="csr",
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 796, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 645, in check_array
    _assert_all_finite(array,
  File "/home/jupyter/tacc-work/jupyter_packages/envs/distance/lib/python3.8/site-packages/sklearn/utils/validation.py", line 97, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

aadharna avatar Jul 30 '20 20:07 aadharna

I have fixed the "Accuracy error" spam message in NCMML and have also added some corrections to contain the overflow when possible. However, the algorithm will still overflow with your data. I think that the initial learning rate is too high for your data and makes them to overflow in the first steps of the algorithm. You may want to try a much lower value of eta0 (perhaps <=1e-6) to avoid the overflow.

Later on I will try that the algorithm does not return a matrix of NAs that causes failures in the tune function when an overflow occurs, and that a valid metric (perhaps Euclidean) is returned along with a warning message indicating that the algorithm has not converged.

jlsuarezdiaz avatar Aug 01 '20 16:08 jlsuarezdiaz

Thanks! I'll update to master and try again with a lower learning rate.

aadharna avatar Aug 02 '20 22:08 aadharna

Dear Author,

I also encounter a similar issue when running KLLDA. Here is my code:

from sklearn.datasets import load_iris iris = load_iris() X = iris['data'] y = iris['target'] lda = KLLDA(n_neighbors=3, gamma=0.1) lda.fit(X,y)

lda.transform(X) yields: array([[ 2.79046973e+01+0.j, -1.14410429e+02+0.j, -1.56547954e+00+0.j, -1.01787596e-01+0.j], [ 2.82405181e+01+0.j, -1.05418117e+02+0.j, -2.22091823e+00+0.j, 1.01015386e-01+0.j], [ 2.59799976e+01+0.j, -1.05051163e+02+0.j, -1.46086899e+00+0.j, -6.43348630e-02+0.j],

Can you help me with that? Than you! Feng

feng-bao-ucsf avatar Oct 07 '20 03:10 feng-bao-ucsf

Hi, thank you for pointing this out. I think I have just fixed this in 1a043d54cbb21ace5488762ebbdc6d75584250ae. Let me know if there is any other problem.

jlsuarezdiaz avatar Oct 07 '20 11:10 jlsuarezdiaz

Hi, thank you for pointing this out. I think I have just fixed this in 1a043d5. Let me know if there is any other problem.

Thank you for the quick reply. I also have another question for KLLDA. In the function there is one key parameter gamma, but it did not apprear in the function _solve_sugiyama. Did you replace it with alpha?

feng-bao-ucsf avatar Oct 07 '20 16:10 feng-bao-ucsf

Hi, gamma is the kernel coefficient that is needed for some of the kernels and it is handled in the superclass KernelDML_Algorithm. Specifically, in the function _get_kernel. This is done following the same outline that scikit-learn uses in other kernel-based algorithms. It has nothing to do with alpha, which is a unique regularization parameter for LLDA and KLLDA.

jlsuarezdiaz avatar Oct 08 '20 10:10 jlsuarezdiaz