Dispatch aggregate
The goal of this PR is to dispatch aggregate, currently written in two files, by directly implementing it in _agg_joiner.py.
Following discussions with @Vincent-Maladiere and @jeromedockes, AggJoiner and AggTarget now require the operations parameter by default, and will try to apply all operations on all columns — as opposed to now, where columns are separated in categorical and numeric and only some operations are computed on each category.
I’m planning on doing follow ups to completely remove the _pandas.py, _polars.py, _namespace.py files, and on refactoring AggTarget with cross-fitting.
Thanks a lot @Vincent-Maladiere !
To answer your points:
- I can remove
_check_dataframesand useCheckInputDataFrame, I just need to handle the case whereaux_table="X"to keep the option of using the placeholder. I agree thatAggTarget'scheck_inputscould be simplified, I plan on cleaning it in another PR (it's also missing some tests). - Sure ! Is it something to do here or in another PR ? Should we open an issue so that we add this method for other transformers like
Joiner? - I'm not sure we want this. On the one hand, it might be useful for long-term compatibility when selectors are public, so that people can either use list of cols or selectors directly. On the other, while I think it makes sense to say "compute the mean on all numeric columns", you are not going to join dataframes on all numeric or all string columns with
key.
- and 2. can be done in this PR, it's not a huge change.
I agree that 3. use cases are rarer, but it looks weird to support selectors for the cols argument only. I might have keys that I can identify using regexes for instance, when the name of the column changes for whatever reason. Not a strong requirement, but it shouldn't be too costly to add.
If think the PR is good for a second pass. I probably need to rename it and add some extra things in the changelog.
@Vincent-Maladiere I addressed points 1 & 2: simplifying AggTarget's _check_dataframes (it was more of a complete RFC of AggTarget) and adding a get_feature_names_out method. Point 3 I didn't implement because I'm not sure it's a good idea to indicate users "you should use selectors to choose keys", so I didn't implement it in this PR. Shouldn't be a blocker for the review though and it's a quite minor change IMO
Sorry, there's some duplicate with Jerome's feedback, we reviewed it simultaneously
After discussion with @GaelVaroquaux, it was decided we won't make selectors public here. Let's open a meta issue on their support and documentation first
Let's aim to remove
_namespace.py,_polars.py, and_pandasin a follow-up PR!
I'm on it :)
the codecov failure is spurious, codecov seems to be mixing the report with that of an older commit and is showing a different diff than the actual one