sklearn-pandas Clarify relationship to ColumnTransformer

trafficstars

Scikit-learn v0.20.0 was just released and includes the ColumnTransformer, which is much related to the DataFrameMapper. It might be helpful for people who come to the sklearn-pandas project to see a note in the README explaining what differences there are between the approaches and when one should be preferred over the other.

Sep 26 '18 19:09 micahjsmith

@micahjsmith Yes, agree, it seems like ColumnTransformer makes a lot of work in a similar way. Definitely worth to mention in the README file. Would you like to make a PR with documentation describing the similarities and differences between these two things?

Oct 07 '18 12:10 devforfu

Sure, I can give it a shot

Oct 07 '18 15:10 micahjsmith

Just checking on this issue. Looks like sklearn now has ColumnTransformer. As such, I'm not sure if there are any additional benefits to using sklearn-pandas. Would someone mind clarifying?

Sep 09 '20 01:09 ganesh-krishnan

I never followed up above...

But maybe we can start collecting differences on this thread. ColumnTransformer is close to feature parity, and APIs I presume may change. They are quite similar overall with some minor differences.

API differences

functionality	DataFrameMapper	ColumnTransformer
drop unmapped cols	`default = False`	`remainder = 'drop'`
drop specific cols	`drop_cols = ['A', 'B']`	`transformer = 'drop'`
passthrough unmapped cols	`default = None`	`remainder = 'passthrough'`
passthrough specific cols	`transformer = None`	`transformer = 'passthrough'`
output dataframe	`df_out = True`	n/a
apply prefix and suffix	prefix and suffix options	n/a
apply default transformer	`default = SomeTransformer()`	n/a
global prefix and suffix	prefix and suffix kwargs	n/a
feature naming	user-specified or automatic	user-specified or use `make_column_transformer`
column selection	`str` or `List[str]`	str, array-like of str, int, array-like of int, array-like of bool, slice or callable
treatment of sparse data	only if `sparse=True` and has sparse output	by default, configurable by `sparse_threshold`
supervised transformations	yes	yes

Other functionality

gen_features

Does this look about right? Are we missing anything?

Sep 09 '20 14:09 micahjsmith

@ganesh-krishnan does the above look about right?

Sep 20 '20 14:09 micahjsmith

Looks good per my understanding.

I'm not an expert in both by any means. Was struggling on which one to choose. A table like this should be very valuable to folks trying to make a decision.

Sep 21 '20 17:09 ganesh-krishnan

Few basic differences in DataFrameMapper() and ColumnTransformer(): https://github.com/arora123/Python-for-Data-Science/blob/master/DataFrameMapper_Vs_Column_Transformer.ipynb

Sep 23 '20 04:09 arora123

sklearn-pandas sklearn-pandas copied to clipboard

Clarify relationship to ColumnTransformer

API differences

Other functionality

sklearn-pandas
sklearn-pandas copied to clipboard