869_course icon indicating copy to clipboard operation
869_course copied to clipboard

Question Regarding Pipeline

Open supradl opened this issue 3 years ago • 2 comments

Dear Uncle Steve,

Thank you so much for your tutorial on Pipelines, I've learned a lot from it.

However, I do have a question regarding feature importance.

In your code, you're able to get the feature importances for the pipeline, however, due to the ColumnTransformer, the name of the columns are taken out and replaced by numbers. This leads to me not being able to tell which columns are important (because the columns are all numbers instead of the original names).

How do we overcome this problem?

Thank you!

Daniel

supradl avatar Dec 09 '21 19:12 supradl

Hi Daniel,

Yes, this is an ongoing challenge with sklearn pipelines with no easy, general answer. There a few bad options:

  • Don't investigate feature importance
  • Manually figure out what each column is, e.g., but using your understanding of what each step in the Pipeline does
  • In some cases, depending on the transformers you used, you can call get_feature_names() and it just works
  • Wrap each transformer with your own custom transformers that maintain DataFrames (and don't' covert to numpy)

Sorry that I could not be of more help. sklearn pipelines are just currently limited by this.

stepthom avatar Dec 10 '21 13:12 stepthom

Dear Uncle Steve,

Thank you for your reply.

I'll definitely take your advice in my future projects. When I was working on my project, I did encounter the suggestion of using get_feature_names() method from various StackOverFlow posts. However, it was extremely hard for me to implement because I always get an error similar to something like "xxx does not have this attribute/method 'get_feature_importances()' ". I also tried to do some research on how to use it but did not end with the results I was looking for. I'm wondering if you could please provide more insight on this method? I guess in terms of the "depending on the transformers you've used", ColumnTransformer wouldn't be one of the transformer that has this attribute? Would there be any other transformers to use in lieu of ColumnTransformer which will allow this method?

Lastly, if nothing works, and I need to know the column names, would you recommend me to do the transformation outside of my Pipeline (I think this is similar to your last option given above)? I guess for this approach I'll have to make sure my transformations will not lead to data leakage. Then after the appropriate transformation, I can set up my Pipeline with the remaining steps that does not involve ColumnTransformer (ie. Scaling, oversampling, etc.).

Sorry for writing a long and rather confusing post, I really appreciate you taking your time to address my questions.

Sincerely,

Daniel

supradl avatar Dec 12 '21 19:12 supradl