sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Clarify relationship to ColumnTransformer

Open micahjsmith opened this issue 6 years ago • 7 comments

Scikit-learn v0.20.0 was just released and includes the ColumnTransformer, which is much related to the DataFrameMapper. It might be helpful for people who come to the sklearn-pandas project to see a note in the README explaining what differences there are between the approaches and when one should be preferred over the other.

micahjsmith avatar Sep 26 '18 19:09 micahjsmith

@micahjsmith Yes, agree, it seems like ColumnTransformer makes a lot of work in a similar way. Definitely worth to mention in the README file. Would you like to make a PR with documentation describing the similarities and differences between these two things?

devforfu avatar Oct 07 '18 12:10 devforfu

Sure, I can give it a shot

micahjsmith avatar Oct 07 '18 15:10 micahjsmith

Just checking on this issue. Looks like sklearn now has ColumnTransformer. As such, I'm not sure if there are any additional benefits to using sklearn-pandas. Would someone mind clarifying?

ganesh-krishnan avatar Sep 09 '20 01:09 ganesh-krishnan

I never followed up above...

But maybe we can start collecting differences on this thread. ColumnTransformer is close to feature parity, and APIs I presume may change. They are quite similar overall with some minor differences.

API differences

functionality DataFrameMapper ColumnTransformer
drop unmapped cols default = False remainder = 'drop'
drop specific cols drop_cols = ['A', 'B'] transformer = 'drop'
passthrough unmapped cols default = None remainder = 'passthrough'
passthrough specific cols transformer = None transformer = 'passthrough'
output dataframe df_out = True n/a
apply prefix and suffix prefix and suffix options n/a
apply default transformer default = SomeTransformer() n/a
global prefix and suffix prefix and suffix kwargs n/a
feature naming user-specified or automatic user-specified or use make_column_transformer
column selection str or List[str] str, array-like of str, int, array-like of int, array-like of bool, slice or callable
treatment of sparse data only if sparse=True and has sparse output by default, configurable by sparse_threshold
supervised transformations yes yes

Other functionality

  • gen_features

Does this look about right? Are we missing anything?

micahjsmith avatar Sep 09 '20 14:09 micahjsmith

@ganesh-krishnan does the above look about right?

micahjsmith avatar Sep 20 '20 14:09 micahjsmith

Looks good per my understanding.

I'm not an expert in both by any means. Was struggling on which one to choose. A table like this should be very valuable to folks trying to make a decision.

ganesh-krishnan avatar Sep 21 '20 17:09 ganesh-krishnan

Few basic differences in DataFrameMapper() and ColumnTransformer(): https://github.com/arora123/Python-for-Data-Science/blob/master/DataFrameMapper_Vs_Column_Transformer.ipynb

arora123 avatar Sep 23 '20 04:09 arora123