evalml
evalml copied to clipboard
Add dimensionality reduction to AutoMLSearch
The exploration of the performance of one of our perf test datasets in #2628 raised the notice that the dataset has too many dimensions when compared to the number of data points, and performance significantly suffers because of it. We have dimensionality reduction components (both PCA and LDA), but right now we have no convenient way to add these to AutoMLSearch
. I see two different ways we could make this easier when given high-dimensional datasets:
- Add a
HighDimensionalityDataCheck
that checks if the ratio of data points to number of features is too high, and an easy flag to add to search that would automatically include dimensionality reduction components in pipelines. - With the addition of the new Default Algorithm and its new "long mode", add testing dimensionality reduction components in pipelines to said long mode to maintain model understanding in fast mode but potentially improve long mode performance.
@eccabay Thanks for submitting! We're going to ice this for now until product/customer demand catches up and we can prioritize this a little better. Great idea.
The motivation for this issue stems from this comment. Some datasets, including restaurants.csv
from issue #2628, are almost doomed to fail thanks to the curse of dimensionality. Adding dimensionality reduction should significantly improve performance on these sorts of datasets.