skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Improve UX of `.skb.subsample` when X and y are separate

Open rcap107 opened this issue 6 months ago • 14 comments

I have the following piece of code

data = fetch_medical_charge()

X = skrub.X(data.X)
y = skrub.y(data.y)

X = X.skb.subsample(n=1000)

I did not set y.skb.subsample because I was expecting that it would also be subsampled (somehow), but that was not the case and computing the preview with a hgb failed because the size of X and y was different.

RuntimeError: Evaluation of 'apply()' failed.
You can see the full traceback above. The error message was:
ValueError: Found input variables with inconsistent numbers of samples: [1000, 163065]

This isn't a problem if X and y are taken from the same dataset that is then sampled, but when that's not the case there is a problem.

I know that it's possible to sample the first 1000 values, but given that there can also be random sampling we might want to add a seed to be able to align X and y when they start separate.

In any case we need to explain in the docs that each variable must be sampled, if it's not based on a variable that's been sampled earlier.

rcap107 avatar Jun 17 '25 15:06 rcap107

I know that it's possible to sample the first 1000 values, but given that there can also be random sampling we might want to add a seed to be able to align X and y when they start separate.

With #1427 we'll be able to set a random seed for subsampling

Vincent-Maladiere avatar Jun 18 '25 12:06 Vincent-Maladiere

I know that it's possible to sample the first 1000 values, but given that there can also be random sampling we might want to add a seed to be able to align X and y when they start separate.

With #1427 we'll be able to set a random seed for subsampling

Wait, the way subsample is implemented, it produced the same random sequence of subsamples for each call (check, but this is the way that I understand it).

Thus there is no need to control the random seed, the behavior should be the right one by default.

It seems that we should explain this better.

GaelVaroquaux avatar Jun 18 '25 12:06 GaelVaroquaux

I know that it's possible to sample the first 1000 values, but given that there can also be random sampling we might want to add a seed to be able to align X and y when they start separate. With #1427 we'll be able to set a random seed for subsampling Wait, the way subsample is implemented, it produced the same random sequence of subsamples for each call (check, but this is the way that I understand it).

Thus there is no need to control the random seed, the behavior should be the right one by default.

It seems that we should explain this better.

To be honest, all I did was looking at the docstring. Normally, any random sampling has some kind of way to set the seed, so that's why I was surprised to not see anything like that.

If that's the case, this is another issue relative to the documentation

rcap107 avatar Jun 18 '25 12:06 rcap107

To be honest, all I did was looking at the docstring. Normally, any random sampling has some kind of way to set the seed, so that's why I was surprised to not see anything like that.

Still, we should show (and potentially example) that you can subsample X and y. Maybe as an example in the example section of the docstring.

Your question was legitimate

GaelVaroquaux avatar Jun 18 '25 13:06 GaelVaroquaux

Just a precision: subsampling is currently using a random seed of 0, and https://github.com/skrub-data/skrub/pull/1427 allows to set a different one

Vincent-Maladiere avatar Jun 18 '25 14:06 Vincent-Maladiere

I'll work on the docstring

rcap107 avatar Jun 18 '25 14:06 rcap107

Talking with @Vincent-Maladiere, we thought of improving the error message by detecting when one of X and y is sampled, but the other isn't. This should be done in .skb.apply when an estimator that takes y is used.

(I posted this in the PR, but I should have put the message here)

rcap107 avatar Jun 19 '25 13:06 rcap107

To improve the documentation, I think there may be a misunderstanding that should be addressed early on in the introduction of subsample. why did you expect that taking a sample from X would affect y? If X and y were pandas or polars dataframes, X.sample() would not modify y -- nor X, actually -- and it is the same for the skrub objects. it also works the same for other operations such as indexing X or multiplying it by 3.14 . So adding a note about the fact that subsampling X doesn't affect y as you did in the PR is very useful, but it might also be worth digging deeper and figuring out why one might expect that X.skb.subsample() would modify y in the first place, and see if we can better introduce subsample to avoid that confusion

jeromedockes avatar Jun 23 '25 05:06 jeromedockes

detecting when one of X and y is sampled, but the other isn't

I'm not sure this would be easy to detect reliably, as subsampling can be used anywhere. perhaps a simpler option would be if the shapes don't match, and subsampling is used at all, mention it as a possible (but not certain) source of the error eg is it possible that you applied skb.subsample() to X and not y? . the drawback is it might mislead some users to look at problems related to subsampling even when the shape mismatch has nothing to do with it, but the implementation would be much simpler

jeromedockes avatar Jun 23 '25 06:06 jeromedockes

Could we pop up this specific error message only when one of the two non-matching dataframe has been susampled and not the other?

On Jun 23, 2025, 09:43, at 09:43, "Jérôme Dockès" @.***> wrote:

jeromedockes left a comment (skrub-data/skrub#1446)

detecting when one of X and y is sampled, but the other isn't

I'm not sure this would be easy to detect reliably, as subsampling can be used anywhere. perhaps a simpler option would be if the shapes don't match, and subsampling is used at all, mention it as a possible (but not certain) source of the error eg is it possible that you applied skb.subsample() to X and not y? . the drawback is it might mislead some users to look at problems related to subsampling even when the shape mismatch has nothing to do with it, but the implementation would be much simpler

-- Reply to this email directly or view it on GitHub: https://github.com/skrub-data/skrub/issues/1446#issuecomment-2995044445 You are receiving this because you commented.

Message ID: @.***>

GaelVaroquaux avatar Jun 23 '25 08:06 GaelVaroquaux

To improve the documentation, I think there may be a misunderstanding that should be addressed early on in the introduction of subsample. why did you expect that taking a sample from X would affect y? If X and y were pandas or polars dataframes, X.sample() would not modify y -- nor X, actually -- and it is the same for the skrub objects. it also works the same for other operations such as indexing X or multiplying it by 3.14 . So adding a note about the fact that subsampling X doesn't affect y as you did in the PR is very useful, but it might also be worth digging deeper and figuring out why one might expect that X.skb.subsample() would modify y in the first place, and see if we can better introduce subsample to avoid that confusion

I think that part of my confusion was caused by the examples where subsampling is done on the very first variable, which is then split into X and y: at that point X and y are subsampled and everything works fine.

My original understanding was that the subsample operation would apply a "mask" (conceptually) over the original data to select only a subset of it, and that the "mask" would be removed for the actual training operation. Given it was a "temporary" operation, I somehow expected the flag "is sampled" to be propagated to all previews that follow the node where it's set, for all variables.

I think we should make the fact that sampling is local clear both in the examples and the documentation.

rcap107 avatar Jun 23 '25 11:06 rcap107

detecting when one of X and y is sampled, but the other isn't

I'm not sure this would be easy to detect reliably, as subsampling can be used anywhere. perhaps a simpler option would be if the shapes don't match, and subsampling is used at all, mention it as a possible (but not certain) source of the error eg is it possible that you applied skb.subsample() to X and not y? . the drawback is it might mislead some users to look at problems related to subsampling even when the shape mismatch has nothing to do with it, but the implementation would be much simpler

I'm fine with this solution for the new PR

rcap107 avatar Jun 23 '25 11:06 rcap107

thanks for explaining. indeed that should be clarified. maybe something like "it is similar to pd.DataFrame.sample() except that it becomes a no-op when keep_subsampling=False" would help

jeromedockes avatar Jun 23 '25 20:06 jeromedockes

I'm not sure this would be easy to detect reliably, as subsampling can be used anywhere.

actually I'm not sure why I said that; something like #1465 should cover simple cases such as the one mentioned in the issue.

it might be worth doing something similar in cross_validate to catch an error such as subsampling X before the split (mark_as_x) and subsampling y after; I'll try to have a look

jeromedockes avatar Jun 23 '25 21:06 jeromedockes