evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Conditionally include the imputer in pipelines

Open eccabay opened this issue 2 years ago • 2 comments

Right now, we add the imputer to every pipeline that includes datatypes that the imputer supports. However, the imputer is only useful when there is missing data for the component to impute. On first sight this doesn't seem to be a huge issue, but recent profiling has highlighted this as a performance issue. Consider this snakeviz profile, where the input data has no missing values: Screen Shot 2022-07-07 at 11 39 25 AM Here, despite having no need for the imputer, the one component takes up a full quarter of the time it takes to call predict on the pipeline.

It is possible that there may be null values in the holdout data, which make_pipeline doesn't see, in which case we may want the imputer regardless. We should discuss this, but my initial thought is that it's too much of an edge case for us to take the performance hit for.

eccabay avatar Jul 07 '22 15:07 eccabay

Thanks for filing this...this is extremely interesting. Which block here is the imputer? The magenta one toward the middle?

chukarsten avatar Jul 08 '22 16:07 chukarsten

@chukarsten yup, that bright magenta, next to the estimator.py:80(predict). The details are on the side, annoyingly the screenshot removed my mouse hovering over that section.

eccabay avatar Jul 08 '22 17:07 eccabay