Implemented adaptive squashing
WIP
Here is an implementation of the robust scaling + smooth clipping from my paper (https://arxiv.org/abs/2407.04491). It could be used as a robust numerical preprocessing methods for neural network based methods.
Some considerations:
- which name do we want to use for the class? Currently it is AdaptiveSquashingTransformer.
- I did not check the rendering of the documentation. Do we need to link the class from somewhere in the documentation?
- If the column passed in transform() has a different name than in fit_transform(), the output column still gets the name of the column passed in fit_transform(). I copied this behavior from StringEncoder, is that good? (See also the todo in the corresponding tests.)
Also, I think having a smallish example that shows when and where to use your class would be great. WDYT?
Sounds good, where would you put it?
I addressed everything except from the example now.
I removed the test that tested if a column with name None after transformation still had name None, since it failed for polars. I assume this is not relevant? (The test is not necessary anymore to reach 100% coverage.)
Sounds good, where would you put it?
@dholzmueller, good question, I'm not sure. What would be the best way to illustrate its usefulness? Do you have a small benchmark or experiment that could fit into an example? Otherwise we might need to be creative
We could try it on a dataset with an MLP and see if it improves something (compared to StandardScaler? or QuantileTransformer?). Don't know what would be the best one to do this.
In what setup (dataset + model) have you observed a sharp improvement in your experiments? Otherwise, we could try on any skrub dataset with an MLP as you suggest
Updates:
- renamed the class to SquashingScaler
- made it work for dataframes and not only series by using OnEachColumn with the old class
- renamed the args to lower_quantile and upper_quantile
- incorporated minor suggestions
- added an example using the SquashingScaler
@GaelVaroquaux was suggesting to implement SquashingScaler directly instead of with the indirection through SingleColumnSquashingScaler. I'll need to check how easy that is...
@GaelVaroquaux was suggesting to implement SquashingScaler directly instead of with the indirection through SingleColumnSquashingScaler. I'll need to check how easy that is...
that's an option too, I'll let you decide!
the downside is we might end up duplicating some of the OnEachColumn's logic to loop over columns, keep track of each column's transformation, and reconstruct a dataframe with the right column names regardless of the scikit-learn set_output setting. so it seems a bit easier to rely on oneachcolumn for that. however, here the transformer is fast enough that we don't need parallelization, and we don't have to worry about duplicate column names because we have a 1-to-1 mapping, and we apply it to all columns, so maybe it is simple enough that we can do without the oneachcolumn.
In any case I think we want to avoid the of allocating one big numpy array in which we copy all the dataframe's columns (as is done in scikit-learn with check_array)
that's an option too, I'll let you decide!
Users (and non advanced programmers) struggle with indirections.
Your point Jerome is about memory overhead? Do you think that 1) they will be really a problem 2) they end up being avoided, because the output of this transformer is likely going to be fed to a scikit-learn learner?
Your point Jerome is about memory overhead?
yes because we go from columnar format, to one contiguous array, then back to columnar (dataframe format). that is the case even if the output dataframe is then passed to an estimator.
- they will be really a problem
not necessarily but just in case it is easy, it is always nice to avoid an extra copy, especially when it requires a much larger contiguous buffer than any column in the input dataframe does.
the main reason I mention it is because the fitting of each column as currently implemented uses python control flow anyway (to handle corner cases such as a column is all empty, the quantiles are equal, or the column is constant), so I suspect it will not be easy to express in a "vectorized" way on one numpy array. so if we end up with a loop over columns anyway, let's avoid building the full array.
if it makes things more complicated you can ignore my comment
Conclusion of the last skrub meeting: this estimator should not use OnEachColumn because it adds an extra layer of abstraction on the base object. Therefore, we have to implement the necessary logic in this transformer.
Update: this is now ready for review. This transformer behaves like any scikit-learn preprocessor, taking a pandas or polars dataframe or numpy array as input, and yielding a numpy array. This can be used within the TableVectorizer or ApplyOnCols without additional logic.
- I think the old behavior was to produce the original datatype (dataframe or numpy array), but I don't know what we actually want.
- Do we want to support applying the SquashingScaler to pd.Series? I think it could be convenient in some cases.
I think it could be convenient in some cases.
Yes, totally. I was trying to minimize the differences between sklearn and skrub scalers APIs, but we definitely can.
Hey @rcap107 thanks for reviewing. I iterated on the documentation; let me know what you think
Hey @rcap107 thanks for reviewing. I iterated on the documentation; let me know what you think
It's much better, thanks a lot! I fixed a small typo, but aside from that I think we can merge this
Great! Waiting for other people to give their feedback
Interface looks good to me! Did you test what happens if you pass in an all-nan column?
I'm waiting for @Vincent-Maladiere 's reply to @dholzmueller 's above, and then merge
Interface looks good to me! Did you test what happens if you pass in an all-nan column?
Good question. We get a column full of nans, since they are left untouched. Is that what you would expect?
Reproducer:
import numpy as np
import skrub
import pandas as pd
skrub.SquashingScaler().fit_transform(
pd.DataFrame(dict(a=[np.nan, np.nan]))
)
Yes, that is what I would expect. Just wasn't sure if it's going to work with the nanmax/nanmin operations.
Just a thought, having quantile_range as a tuple instead of two separate values might make it harder for people to tune it in some hyperparameter optimization. I think it's probably not an important hyperparameter to tune, though.
Just a thought, having quantile_range as a tuple instead of two separate values might make it harder for people to tune it in some hyperparameter optimization. I think it's probably not an important hyperparameter to tune, though.
This can be addressed in another PR
This can be addressed in another PR
you mean, another one for the same release? because it's not as convenient to change the parametes after release, right?
This can be addressed in another PR
you mean, another one for the same release? because it's not as convenient to change the parametes after release, right?
It's not a very big deal, and it should not be a big change, up to you @Vincent-Maladiere
(I'd rather have this PR in the release and open a new one later than risk not having it)
That's a good point, both sklearn's RobustScaler and MinMaxScaler have a tuple range as a hyperparameter, so I think this is fine.
okay, then let's leave it like it is.
OK, I suggest that we merge, and worry about the NaN thing later: it can be addressed later