auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

FastICA: `n_components is too large: it will be set to _`

Open lsorber opened this issue 7 years ago • 4 comments

FastICA currently looks for n_components between 10 and 2000, with a default of 100. However, assuming there are more rows N than columns M, a data set can only support up to M independent components. The suggested range is therefore between 2 and M, with a default of round(M / 2).

Looking at the implementation, it appears information about the data set is not available in get_hyperparameter_search_space. As a workaround, the n_components property could be replaced by a n_components_rel property that indicates the amount of independent components relative to the total number of features M. When calling sklearn.decomposition.FastICA, you could then simply pass n_components=X.shape[1] * self.n_components_rel.

lsorber avatar Feb 24 '17 12:02 lsorber

I want to work on this issue, let me know if it's still up for grabs.

PR idea:

  • I was thinking we can have a new key x_shape in the info property of datamanager object. x_shape will be a tuple of array dimensions for X_train, similar to shape for NumPy arrays.
  • Add x_shape to dataset_properties.
  • In the function get_hyperparameter_search_space(), we use data_properties to change n_components to n_components_rel.

duskybomb avatar Mar 30 '21 18:03 duskybomb

Is there even a possibility that an optimal N exists for dimensionality reduction?

BradKML avatar Feb 29 '24 12:02 BradKML

There is no optimal, that's why it's a hyperparameter to be searched over :)

eddiebergman avatar Feb 29 '24 16:02 eddiebergman

@eddiebergman is there such a thing as optimal N for dimensionality reduction, then optimal K for clustering with the least amount of error? https://www.kaggle.com/datasets/ramontanoeiro/big-five-personality-test-removed-nan-and-0 https://www.nature.com/articles/s41562-018-0419-z

BradKML avatar Mar 11 '24 04:03 BradKML