evalml
evalml copied to clipboard
Conditionally include the imputer in pipelines
Right now, we add the imputer to every pipeline that includes datatypes that the imputer supports. However, the imputer is only useful when there is missing data for the component to impute. On first sight this doesn't seem to be a huge issue, but recent profiling has highlighted this as a performance issue. Consider this snakeviz profile, where the input data has no missing values:
Here, despite having no need for the imputer, the one component takes up a full quarter of the time it takes to call predict on the pipeline.
It is possible that there may be null values in the holdout data, which make_pipeline
doesn't see, in which case we may want the imputer regardless. We should discuss this, but my initial thought is that it's too much of an edge case for us to take the performance hit for.
Thanks for filing this...this is extremely interesting. Which block here is the imputer? The magenta one toward the middle?
@chukarsten yup, that bright magenta, next to the estimator.py:80(predict)
. The details are on the side, annoyingly the screenshot removed my mouse hovering over that section.