Introduce a "join" operator for simpler join execution
Problem Description
The current implementation of Joiner and MultiAggJoiner has some limitations which are in part caused by the fact they need to follow the scikit-learn estimator template:
- It implements only the left join. This makes sense as an estimator because the number of samples must remain constant. However, a user may expect to be able to perform any other kind of join (inner, outer, anti...) since that is the behavior of pandas or polars merge operators.
- It is hard to put in production because the join tables are defined in the init and may change between the init and when the join is executed.
( @Vincent-Maladiere ) - In general, the fit/transform structure makes it clunky to use if the user only needs it to perform multiple joins and does not care about putting it into a pipeline.
I think it would be useful to have a more lightweight "join operator" that implements the join without the constraints of the estimator.
Feature Description
Rather than the current implementation, a join_tables operator would look similar to this:
joined_table = skrub.join_tables(main_table,
aux_tables=[aux_table_1, aux_table_2, ...],
left_on=["key1", "key2"],
right_on=['"id1", "id2"], how="inner"
)
I am calling this an "operator" because it will operate directly on the given tables, and is stateless.
It should be possible to reuse most of the machinery that has already been implemented in the Joiners, so it should not be too complicated to implement.
Alternative Solutions
No response
Additional Context
No response
If this is up for grabs, then I'd love to take this up!
That's great @Neilblaze ! I think it would be best to wait for #1233 to be merged before we start working on this one, because #1233 will influence what we do with the joiners
@jeromedockes gotcha, thanks for the heads up! Will keep an eye on #1233 and will proceed once it's been merged 😃
Reviving this issue, I think we should have a fuzzy join function that:
- doesn't perform vectorization
- therefore, is stateless
- therefore, only accept numeric columns
The reason is that doing fuzzy-joining between different entities (year, job title) doesn't make sense from an L2 distance perspective, and the results are quite bad.
Then there's a discussion we should have regarding the choice of distance metrics. The L2 as default makes sense, but we might want to include others, like the Haversine distance.
Longer term, I wonder whether we should depreciate AggJoiner and AggTarget, since they can't be used in a production environment. They might bring values for some analysis, but I wonder if that's worth having them. WDYT?
I agree with the feeling and the general angle suggested here.
I'd like to work on docs, example, and usability of expressions before we move on this problem.