skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Dispatch aggregate

Open TheooJ opened this issue 1 year ago • 2 comments

The goal of this PR is to dispatch aggregate, currently written in two files, by directly implementing it in _agg_joiner.py.

Following discussions with @Vincent-Maladiere and @jeromedockes, AggJoiner and AggTarget now require the operations parameter by default, and will try to apply all operations on all columns — as opposed to now, where columns are separated in categorical and numeric and only some operations are computed on each category.

I’m planning on doing follow ups to completely remove the _pandas.py, _polars.py, _namespace.py files, and on refactoring AggTarget with cross-fitting.

TheooJ avatar Oct 17 '24 15:10 TheooJ

Thanks a lot @Vincent-Maladiere !

To answer your points:

  1. I can remove _check_dataframes and use CheckInputDataFrame, I just need to handle the case where aux_table="X" to keep the option of using the placeholder. I agree that AggTarget's check_inputs could be simplified, I plan on cleaning it in another PR (it's also missing some tests).
  2. Sure ! Is it something to do here or in another PR ? Should we open an issue so that we add this method for other transformers like Joiner ?
  3. I'm not sure we want this. On the one hand, it might be useful for long-term compatibility when selectors are public, so that people can either use list of cols or selectors directly. On the other, while I think it makes sense to say "compute the mean on all numeric columns", you are not going to join dataframes on all numeric or all string columns with key.

TheooJ avatar Oct 22 '24 15:10 TheooJ

  1. and 2. can be done in this PR, it's not a huge change.

I agree that 3. use cases are rarer, but it looks weird to support selectors for the cols argument only. I might have keys that I can identify using regexes for instance, when the name of the column changes for whatever reason. Not a strong requirement, but it shouldn't be too costly to add.

Vincent-Maladiere avatar Oct 22 '24 16:10 Vincent-Maladiere

If think the PR is good for a second pass. I probably need to rename it and add some extra things in the changelog.

@Vincent-Maladiere I addressed points 1 & 2: simplifying AggTarget's _check_dataframes (it was more of a complete RFC of AggTarget) and adding a get_feature_names_out method. Point 3 I didn't implement because I'm not sure it's a good idea to indicate users "you should use selectors to choose keys", so I didn't implement it in this PR. Shouldn't be a blocker for the review though and it's a quite minor change IMO

TheooJ avatar Nov 05 '24 20:11 TheooJ

Sorry, there's some duplicate with Jerome's feedback, we reviewed it simultaneously

Vincent-Maladiere avatar Nov 06 '24 16:11 Vincent-Maladiere

After discussion with @GaelVaroquaux, it was decided we won't make selectors public here. Let's open a meta issue on their support and documentation first

Let's aim to remove _namespace.py, _polars.py, and _pandas in a follow-up PR!

I'm on it :)

TheooJ avatar Nov 13 '24 11:11 TheooJ

the codecov failure is spurious, codecov seems to be mixing the report with that of an older commit and is showing a different diff than the actual one

jeromedockes avatar Nov 14 '24 08:11 jeromedockes