evalml Add dimensionality reduction to AutoMLSearch

Add dimensionality reduction to AutoMLSearch

Open eccabay opened this issue 3 years ago • 2 comments

The exploration of the performance of one of our perf test datasets in #2628 raised the notice that the dataset has too many dimensions when compared to the number of data points, and performance significantly suffers because of it. We have dimensionality reduction components (both PCA and LDA), but right now we have no convenient way to add these to AutoMLSearch. I see two different ways we could make this easier when given high-dimensional datasets:

Add a HighDimensionalityDataCheck that checks if the ratio of data points to number of features is too high, and an easy flag to add to search that would automatically include dimensionality reduction components in pipelines.
With the addition of the new Default Algorithm and its new "long mode", add testing dimensionality reduction components in pipelines to said long mode to maintain model understanding in fast mode but potentially improve long mode performance.

Sep 07 '21 12:09 eccabay

@eccabay Thanks for submitting! We're going to ice this for now until product/customer demand catches up and we can prioritize this a little better. Great idea.

Sep 08 '21 19:09 chukarsten

The motivation for this issue stems from this comment. Some datasets, including restaurants.csv from issue #2628, are almost doomed to fail thanks to the curse of dimensionality. Adding dimensionality reduction should significantly improve performance on these sorts of datasets.

Apr 14 '22 14:04 eccabay

evalml evalml copied to clipboard

Add dimensionality reduction to AutoMLSearch

evalml
evalml copied to clipboard