fix: replace columns with many nulls with missing indicators
Fixes
Fixes #1714
Description
This PR replaces columns with many nulls with missing indicators. Instead of dropping columns that exceed the drop_null_fraction threshold, replace them with missing indicator columns (float32: 1.0 for missing, 0.0 for present). This preserves information about whether values were present while avoiding spending feature dimensions on sparse columns.
- Modify DropUninformative to return missing indicators for columns with many nulls (when
drop_null_fraction < 1.0) - Entirely null columns are still dropped (as the indicator would be constant)
- Default behavior unchanged (
drop_null_fraction=1.0only drops entirely null columns) - Update documentation for
DropUninformative,Cleaner, andTableVectorizer - Add tests for missing indicator replacement
Hi @honghanhh, thanks for the PR. I think it's already in a good shape, but I am not fully convinced by this implementation. I think it would have been better to discuss the details of the implementation in the original issue, rather than opening the PR directly.
I don't think we should replace the "drop column" functionality entirely, because very often if the columns in my dataframe are mostly nulls I might just want to get rid of them, as I don't care about their content at all.
I would rather allow the user to choose between dropping the columns (selecting the fraction), and replacing null values with the indicator. Something like:
DropUninformative(drop_if_null_fraction="replace")
as an additional parameter besides the fraction and None.
In general, is there a reason for using a threshold to decide which columns should use the missingness indicator, rather than using it everywhere? I am not familiar with its usage, so I don't know how it's normally done. Paging @jeromedockes as well.
@rcap107 I see your point and agree. The current implementation doesn’t let you drop columns when drop_null_fraction < 1.0. In this case, it makes sense to have option so user can decide either
- Drop entirely null columns
- Replace partially null columns with indicators (can’t drop them)
I would propose adding a strategy parameter:
drop_null_fraction: float or None, default=1.0
strategy: {"drop", "replace"}, default="drop"
where "drop": Drop columns above the threshold (or entirely null) "replace": Replace columns above the threshold with indicators
What do you think? Let's switch this discussion and finalize the solution in #1714. And then, I would love to update this PR.
In general, is there a reason for using a threshold to decide which columns should use the missingness indicator, rather than using it everywhere? I am not familiar with its usage, so I don't know how it's normally done. Paging @jeromedockes as well
for columns we keep, typically either
- we don't impute missing values and there is no need for the indicator -- the nulls are preserved in the original column. example: when using the hist gradient boosting
- or we impute missing values and add the indicator (for example with scikit-learn's SimpleImputer which has an option for adding the indicator)
but here the issue is about the behavior of DropUninformative, which allows dropping the column altogether if it contains many nulls. that option can be problematic because it is quite frequent that a column is mostly empty, but the rare event that a value is present is actually very informative. so in that case, I think a good tradeoff can be keeping just the missingness mask. it preserves the information that the value was there or not, without wasting many dimensions on encoding values for which we have too few examples to use them effectively or on a column that might not be useful for prediction. in the tablevectorizer for example, I htink it might be a 'safer' default behavior
I haven't thought it through any further than that
@rcap107 I see your point and agree. The current implementation doesn’t let you drop columns when
drop_null_fraction < 1.0. In this case, it makes sense to have option so user can decide either
- Drop entirely null columns
- Replace partially null columns with indicators (can’t drop them)
I would propose adding a
strategyparameter:drop_null_fraction: float or None, default=1.0 strategy: {"drop", "replace"}, default="drop"where "drop": Drop columns above the threshold (or entirely null) "replace": Replace columns above the threshold with indicators
What do you think? Let's switch this discussion and finalize the solution in #1714. And then, I would love to update this PR.
Hi @honghanhh, we can continue talking here, no problem (and I apologize if my previous comment sounded rude).
About the strategy, I think we should distinguish between using DropUninformative as a cleaning tool, and as an encoding tool.
In the first case (more likely to be encountered by the Cleaner), a user might want to use DropUninformative to remove the columns directly: the transformer becomes a convenience function for removing columns with a lot of null values.
However, this might not be the best solution if we actually want to encode the column (more of a TableVectorizer use case), because then the missing values become informative on their own, and for that we should use the missingness indicator. Wouldn't replacing the data with the missingness mask remove information anyway, however?
I am wondering if we should have a transformer that is separate from DropUninformative, and that handles encoding columns with missing indicators. In this way, we could keep the functionality of DropUninformative clear and limited to "dropping columns we consider uninformative" according to various requirements, and have a new transformer (e.g., MissingnessEncoder, name tbd)that deals with missing values. Then, the strategy used by the TableVectorizer could be different from what is used in the Cleaner to better define either transformer's role.
We could also extend the current has_nulls() selector so that it lets the user specify the null fraction. Then, it would become possible to do something like:
ApplyToCols(MissingnessEncoder(), cols=s.has_nulls(fraction=0.5))
This clearly depends on the interface we're going with, and modifying the selector should be done in a separate PR.
What do you think? @honghanhh @jeromedockes
What do you think? @honghanhh @jeromedockes
I agree with your point about the distinction between cleaning / removing poorly-filled columns and encoding / finding a good representation for columns with many nulls. In a way that reflects the 2 reasons a field might be NULL: the information is missing due to poor data quality, or it is a true NULL / the field does not apply (eg "pool surface" for a house that doesn't have a pool)
maybe as you say the best option is to have a selector for this null proportion threshold and keep the simple "drop" option in the cleaner ...
note there is no need for a new MissingnessEncoder, as it already exists in sklearn.preprocessing