auto-sklearn
auto-sklearn copied to clipboard
FastICA: `n_components is too large: it will be set to _`
FastICA currently looks for n_components
between 10 and 2000, with a default of 100. However, assuming there are more rows N than columns M, a data set can only support up to M independent components. The suggested range is therefore between 2 and M, with a default of round(M / 2)
.
Looking at the implementation, it appears information about the data set is not available in get_hyperparameter_search_space
. As a workaround, the n_components
property could be replaced by a n_components_rel
property that indicates the amount of independent components relative to the total number of features M. When calling sklearn.decomposition.FastICA
, you could then simply pass n_components=X.shape[1] * self.n_components_rel
.
I want to work on this issue, let me know if it's still up for grabs.
PR idea:
- I was thinking we can have a new key
x_shape
in theinfo
property ofdatamanager
object.x_shape
will be a tuple of array dimensions for X_train, similar to shape for NumPy arrays. - Add
x_shape
todataset_properties
. - In the function
get_hyperparameter_search_space()
, we usedata_properties
to changen_components
ton_components_rel
.
Is there even a possibility that an optimal N exists for dimensionality reduction?
There is no optimal, that's why it's a hyperparameter to be searched over :)
@eddiebergman is there such a thing as optimal N for dimensionality reduction, then optimal K for clustering with the least amount of error? https://www.kaggle.com/datasets/ramontanoeiro/big-five-personality-test-removed-nan-and-0 https://www.nature.com/articles/s41562-018-0419-z