BUG - Clarification of the behavior of DropCols
The behavior of DropCols is not consistent with what it says in the docstring.
The docstring states
The other columns are kept in their original order. A ValueError is raised if any of the provided column names are not in the dataframe.
But the actual behavior is different: if the column to drop is not found in the dataframe, then no exception is raised.
>>> from skrub import DropCols
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1,2,3,4], "b":[3,4,5,6]})
>>> drop = DropCols("a")
>>> df_drop = df.drop(columns="a")
>>> df_drop
b
0 3
1 4
2 5
3 6
>>> drop.fit_transform(df)
b
0 3
1 4
2 5
3 6
>>> drop.transform(df_drop)
b
0 3
1 4
2 5
3 6
Either the docstring must be fixed, or the behavior of the transformer must be adapted so that not finding the column name raises an exception. I think the current behavior of DropCols makes more sense, but I can also see some value in raising an error if the column is not found. Maybe detecting some problem upstream?
There are a few different approaches that we can take:
- We don't raise errors in either fit or transform if we try to drop a column that isn't found in the column names
- We try to enforce the same behavior in fit and transform, so that if a column is found at fit time, but isn't found at transform time, an exception is raise
- We raise an exception in fit if the column isn't found (this is closer to the default behavior of pandas drop https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop)
I think we should proceed with the first solution, as it is simpler to implement and explain in the docs. We might want to re-evaluate this in the future.
As a result, the current issue becomes one of improving the docstring of DropCols so that the current behavior is explained properly.
Hello @rcap107
Newbie here!
I ran the exemplary code above, and with a non-existent column d
As expected, both drop.fit(df) and drop.fit_transform(df) threw the same error: ValueError: The following columns are requested for selection but missing from dataframe: ['d']
On the other hand, the transform method does not throw an error if the column is not found. My question is, is it a bad idea given that either fit or fit_transform will be called first?
Also, will it suffice to specify in the documentation that the transform method drops the columns only if they exist, and otherwise ignores them?
Hello @rcap107
Newbie here!
I ran the exemplary code above, and with a non-existent column
dAs expected, bothdrop.fit(df)anddrop.fit_transform(df)threw the same error:ValueError: The following columns are requested for selection but missing from dataframe: ['d']On the other hand, the
transformmethod does not throw an error if the column is not found. My question is, is it a bad idea given that eitherfitorfit_transformwill be called first?Also, will it suffice to specify in the documentation that the
transformmethod drops the columns only if they exist, and otherwise ignores them?
Hello @Faith-Nchifor, thanks for checking out the issue.
There are two parts to the answer.
- In general, it's quite common (though it doesn't always happen) that if the transformer was fitted successfully, it will not run some of the validation checks at transform time. So, that's something that happens.
- In this case, the trasform of
DropColsis just selecting whatever columns were not dropped during the fit. This means that if it does not finddat transform time, nothing happens. However, if it cannot find one of the non-dropped columns, it will fail:
import pandas as pd
from skrub import DropCols
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4]})
d = DropCols("a")
d.fit_transform(df)
If we try to transform after dropping "a" nothing happens
d.transform(df.drop(columns="a"))
But dropping "b" will raise an exception
d.transform(df.drop(columns="b"))
I am not completely sure this is the behavior we want 🤔 What do you think @jeromedockes?
I think that in any case the transform behavior should be made explicit in the docstring, whatever is the behavior we decide to go for.
okay, I get it now.
okay, I get it now.
Hey @Faith-Nchifor, if you are considering working on this issue I would suggest picking a different one, because at the moment the direction we want to go in is not clear and I'd rather avoid wasting time on your side with something that we might need to rework in the end
It seems part of this inconsistent behaviour is down to DropCols running a select on everything except the columns to drop. This may cause further issues if a DropCols is run on a table that does not contain certain columns that it was originally fit on. I can take a look at that specific issue soon!
From IRL discussion:
The current behavior of DropCols is that an exception is raised if one of the columns that isn't supposed to be dropped is missing from the list (see https://github.com/skrub-data/skrub/issues/1787#issuecomment-3636017925). This is consistent with what scikit-learn does, that is raising an exception if the input columns at transform time are different from those at fit time.
We should keep this behavior, look at what scikit-learn does, and ensure consistency.
Additionally, there should be a custom warning that is raised in the case in which the column that is supposed to be dropped is not found in the list of columns, or when the selector that is passed to DropCols does not find anything. This warning should be something that users can easily disable and filter out if they do not need this kind of detail. I am not sure yet how to implement this.
Something else that should be fixed in the same PR is renaming the attributes of DropCols and SelectCols so they end with _, see https://github.com/scikit-learn/scikit-learn/issues/32910
Something else that should be fixed in the same PR is renaming the attributes of DropCols and SelectCols so they end with _, see https://github.com/scikit-learn/scikit-learn/issues/32910
private attribute names don't have to / shouldn't end in _ . if we decide to make them public after settling on the semantics in this issue then _kept_cols would be renamed to kept_cols_ and scikit-learn's default heuristic for detecting if an estimator is fitted by looking for any public attribute would work. otherwise an estimator that has no public attributes can tell scikit-learn that it is fitted by defining __sklearn_is_fitted__