skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Implemented adaptive squashing

Open dholzmueller opened this issue 8 months ago • 13 comments

WIP

dholzmueller avatar Apr 11 '25 14:04 dholzmueller

Here is an implementation of the robust scaling + smooth clipping from my paper (https://arxiv.org/abs/2407.04491). It could be used as a robust numerical preprocessing methods for neural network based methods.

Some considerations:

  • which name do we want to use for the class? Currently it is AdaptiveSquashingTransformer.
  • I did not check the rendering of the documentation. Do we need to link the class from somewhere in the documentation?
  • If the column passed in transform() has a different name than in fit_transform(), the output column still gets the name of the column passed in fit_transform(). I copied this behavior from StringEncoder, is that good? (See also the todo in the corresponding tests.)

dholzmueller avatar Apr 11 '25 16:04 dholzmueller

Also, I think having a smallish example that shows when and where to use your class would be great. WDYT?

Sounds good, where would you put it?

dholzmueller avatar Apr 14 '25 15:04 dholzmueller

I addressed everything except from the example now.

dholzmueller avatar Apr 14 '25 15:04 dholzmueller

I removed the test that tested if a column with name None after transformation still had name None, since it failed for polars. I assume this is not relevant? (The test is not necessary anymore to reach 100% coverage.)

dholzmueller avatar Apr 14 '25 16:04 dholzmueller

Sounds good, where would you put it?

@dholzmueller, good question, I'm not sure. What would be the best way to illustrate its usefulness? Do you have a small benchmark or experiment that could fit into an example? Otherwise we might need to be creative

Vincent-Maladiere avatar Apr 15 '25 12:04 Vincent-Maladiere

We could try it on a dataset with an MLP and see if it improves something (compared to StandardScaler? or QuantileTransformer?). Don't know what would be the best one to do this.

dholzmueller avatar Apr 15 '25 15:04 dholzmueller

In what setup (dataset + model) have you observed a sharp improvement in your experiments? Otherwise, we could try on any skrub dataset with an MLP as you suggest

Vincent-Maladiere avatar Apr 15 '25 16:04 Vincent-Maladiere

Updates:

  • renamed the class to SquashingScaler
  • made it work for dataframes and not only series by using OnEachColumn with the old class
  • renamed the args to lower_quantile and upper_quantile
  • incorporated minor suggestions
  • added an example using the SquashingScaler

dholzmueller avatar May 06 '25 14:05 dholzmueller

@GaelVaroquaux was suggesting to implement SquashingScaler directly instead of with the indirection through SingleColumnSquashingScaler. I'll need to check how easy that is...

dholzmueller avatar May 07 '25 14:05 dholzmueller

@GaelVaroquaux was suggesting to implement SquashingScaler directly instead of with the indirection through SingleColumnSquashingScaler. I'll need to check how easy that is...

that's an option too, I'll let you decide!

the downside is we might end up duplicating some of the OnEachColumn's logic to loop over columns, keep track of each column's transformation, and reconstruct a dataframe with the right column names regardless of the scikit-learn set_output setting. so it seems a bit easier to rely on oneachcolumn for that. however, here the transformer is fast enough that we don't need parallelization, and we don't have to worry about duplicate column names because we have a 1-to-1 mapping, and we apply it to all columns, so maybe it is simple enough that we can do without the oneachcolumn.

In any case I think we want to avoid the of allocating one big numpy array in which we copy all the dataframe's columns (as is done in scikit-learn with check_array)

jeromedockes avatar May 07 '25 15:05 jeromedockes

that's an option too, I'll let you decide!

Users (and non advanced programmers) struggle with indirections.

Your point Jerome is about memory overhead? Do you think that 1) they will be really a problem 2) they end up being avoided, because the output of this transformer is likely going to be fed to a scikit-learn learner?

GaelVaroquaux avatar May 07 '25 15:05 GaelVaroquaux

Your point Jerome is about memory overhead?

yes because we go from columnar format, to one contiguous array, then back to columnar (dataframe format). that is the case even if the output dataframe is then passed to an estimator.

  1. they will be really a problem

not necessarily but just in case it is easy, it is always nice to avoid an extra copy, especially when it requires a much larger contiguous buffer than any column in the input dataframe does.

the main reason I mention it is because the fitting of each column as currently implemented uses python control flow anyway (to handle corner cases such as a column is all empty, the quantiles are equal, or the column is constant), so I suspect it will not be easy to express in a "vectorized" way on one numpy array. so if we end up with a loop over columns anyway, let's avoid building the full array.

if it makes things more complicated you can ignore my comment

jeromedockes avatar May 07 '25 15:05 jeromedockes

Conclusion of the last skrub meeting: this estimator should not use OnEachColumn because it adds an extra layer of abstraction on the base object. Therefore, we have to implement the necessary logic in this transformer.

Vincent-Maladiere avatar Jun 02 '25 13:06 Vincent-Maladiere

Update: this is now ready for review. This transformer behaves like any scikit-learn preprocessor, taking a pandas or polars dataframe or numpy array as input, and yielding a numpy array. This can be used within the TableVectorizer or ApplyOnCols without additional logic.

Vincent-Maladiere avatar Jul 07 '25 14:07 Vincent-Maladiere

  • I think the old behavior was to produce the original datatype (dataframe or numpy array), but I don't know what we actually want.
  • Do we want to support applying the SquashingScaler to pd.Series? I think it could be convenient in some cases.

dholzmueller avatar Jul 11 '25 14:07 dholzmueller

I think it could be convenient in some cases.

Yes, totally. I was trying to minimize the differences between sklearn and skrub scalers APIs, but we definitely can.

Vincent-Maladiere avatar Jul 11 '25 14:07 Vincent-Maladiere

Hey @rcap107 thanks for reviewing. I iterated on the documentation; let me know what you think

Vincent-Maladiere avatar Jul 15 '25 10:07 Vincent-Maladiere

Hey @rcap107 thanks for reviewing. I iterated on the documentation; let me know what you think

It's much better, thanks a lot! I fixed a small typo, but aside from that I think we can merge this

rcap107 avatar Jul 15 '25 10:07 rcap107

Great! Waiting for other people to give their feedback

Vincent-Maladiere avatar Jul 15 '25 13:07 Vincent-Maladiere

Interface looks good to me! Did you test what happens if you pass in an all-nan column?

dholzmueller avatar Jul 22 '25 15:07 dholzmueller

I'm waiting for @Vincent-Maladiere 's reply to @dholzmueller 's above, and then merge

GaelVaroquaux avatar Jul 22 '25 15:07 GaelVaroquaux

Interface looks good to me! Did you test what happens if you pass in an all-nan column?

Good question. We get a column full of nans, since they are left untouched. Is that what you would expect?

Reproducer:

import numpy as np
import skrub
import pandas as pd

skrub.SquashingScaler().fit_transform(
    pd.DataFrame(dict(a=[np.nan, np.nan]))
)

Vincent-Maladiere avatar Jul 23 '25 09:07 Vincent-Maladiere

Yes, that is what I would expect. Just wasn't sure if it's going to work with the nanmax/nanmin operations.

dholzmueller avatar Jul 23 '25 11:07 dholzmueller

Just a thought, having quantile_range as a tuple instead of two separate values might make it harder for people to tune it in some hyperparameter optimization. I think it's probably not an important hyperparameter to tune, though.

dholzmueller avatar Jul 23 '25 11:07 dholzmueller

Just a thought, having quantile_range as a tuple instead of two separate values might make it harder for people to tune it in some hyperparameter optimization. I think it's probably not an important hyperparameter to tune, though.

This can be addressed in another PR

rcap107 avatar Jul 23 '25 11:07 rcap107

This can be addressed in another PR

you mean, another one for the same release? because it's not as convenient to change the parametes after release, right?

dholzmueller avatar Jul 23 '25 11:07 dholzmueller

This can be addressed in another PR

you mean, another one for the same release? because it's not as convenient to change the parametes after release, right?

It's not a very big deal, and it should not be a big change, up to you @Vincent-Maladiere

(I'd rather have this PR in the release and open a new one later than risk not having it)

rcap107 avatar Jul 23 '25 11:07 rcap107

That's a good point, both sklearn's RobustScaler and MinMaxScaler have a tuple range as a hyperparameter, so I think this is fine.

Vincent-Maladiere avatar Jul 23 '25 12:07 Vincent-Maladiere

okay, then let's leave it like it is.

dholzmueller avatar Jul 23 '25 12:07 dholzmueller

OK, I suggest that we merge, and worry about the NaN thing later: it can be addressed later

GaelVaroquaux avatar Jul 23 '25 12:07 GaelVaroquaux