sklearn-pandas
sklearn-pandas copied to clipboard
Clarify relationship to ColumnTransformer
Scikit-learn v0.20.0 was just released and includes the ColumnTransformer, which is much related to the DataFrameMapper. It might be helpful for people who come to the sklearn-pandas project to see a note in the README explaining what differences there are between the approaches and when one should be preferred over the other.
@micahjsmith Yes, agree, it seems like ColumnTransformer
makes a lot of work in a similar way. Definitely worth to mention in the README file. Would you like to make a PR with documentation describing the similarities and differences between these two things?
Sure, I can give it a shot
Just checking on this issue. Looks like sklearn
now has ColumnTransformer
. As such, I'm not sure if there are any additional benefits to using sklearn-pandas
. Would someone mind clarifying?
I never followed up above...
But maybe we can start collecting differences on this thread. ColumnTransformer is close to feature parity, and APIs I presume may change. They are quite similar overall with some minor differences.
API differences
functionality | DataFrameMapper | ColumnTransformer |
---|---|---|
drop unmapped cols | default = False |
remainder = 'drop' |
drop specific cols | drop_cols = ['A', 'B'] |
transformer = 'drop' |
passthrough unmapped cols | default = None |
remainder = 'passthrough' |
passthrough specific cols | transformer = None |
transformer = 'passthrough' |
output dataframe | df_out = True |
n/a |
apply prefix and suffix | prefix and suffix options | n/a |
apply default transformer | default = SomeTransformer() |
n/a |
global prefix and suffix | prefix and suffix kwargs | n/a |
feature naming | user-specified or automatic | user-specified or use make_column_transformer |
column selection | str or List[str] |
str, array-like of str, int, array-like of int, array-like of bool, slice or callable |
treatment of sparse data | only if sparse=True and has sparse output |
by default, configurable by sparse_threshold |
supervised transformations | yes | yes |
Other functionality
-
gen_features
Does this look about right? Are we missing anything?
@ganesh-krishnan does the above look about right?
Looks good per my understanding.
I'm not an expert in both by any means. Was struggling on which one to choose. A table like this should be very valuable to folks trying to make a decision.
Few basic differences in DataFrameMapper() and ColumnTransformer(): https://github.com/arora123/Python-for-Data-Science/blob/master/DataFrameMapper_Vs_Column_Transformer.ipynb