skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Introduce a "join" operator for simpler join execution

Open rcap107 opened this issue 10 months ago • 5 comments

Problem Description

The current implementation of Joiner and MultiAggJoiner has some limitations which are in part caused by the fact they need to follow the scikit-learn estimator template:

  • It implements only the left join. This makes sense as an estimator because the number of samples must remain constant. However, a user may expect to be able to perform any other kind of join (inner, outer, anti...) since that is the behavior of pandas or polars merge operators.
  • It is hard to put in production because the join tables are defined in the init and may change between the init and when the join is executed.
    ( @Vincent-Maladiere )
  • In general, the fit/transform structure makes it clunky to use if the user only needs it to perform multiple joins and does not care about putting it into a pipeline.

I think it would be useful to have a more lightweight "join operator" that implements the join without the constraints of the estimator.

Feature Description

Rather than the current implementation, a join_tables operator would look similar to this:

joined_table = skrub.join_tables(main_table, 
   aux_tables=[aux_table_1, aux_table_2, ...], 
   left_on=["key1", "key2"], 
   right_on=['"id1", "id2"], how="inner"
)

I am calling this an "operator" because it will operate directly on the given tables, and is stateless.

It should be possible to reuse most of the machinery that has already been implemented in the Joiners, so it should not be too complicated to implement.

Alternative Solutions

No response

Additional Context

No response

rcap107 avatar Feb 10 '25 11:02 rcap107

If this is up for grabs, then I'd love to take this up!

Neilblaze avatar Mar 31 '25 09:03 Neilblaze

That's great @Neilblaze ! I think it would be best to wait for #1233 to be merged before we start working on this one, because #1233 will influence what we do with the joiners

jeromedockes avatar Mar 31 '25 10:03 jeromedockes

@jeromedockes gotcha, thanks for the heads up! Will keep an eye on #1233 and will proceed once it's been merged 😃

Neilblaze avatar Mar 31 '25 15:03 Neilblaze

Reviving this issue, I think we should have a fuzzy join function that:

  • doesn't perform vectorization
  • therefore, is stateless
  • therefore, only accept numeric columns

The reason is that doing fuzzy-joining between different entities (year, job title) doesn't make sense from an L2 distance perspective, and the results are quite bad.

Then there's a discussion we should have regarding the choice of distance metrics. The L2 as default makes sense, but we might want to include others, like the Haversine distance.

Longer term, I wonder whether we should depreciate AggJoiner and AggTarget, since they can't be used in a production environment. They might bring values for some analysis, but I wonder if that's worth having them. WDYT?

Vincent-Maladiere avatar Jun 30 '25 11:06 Vincent-Maladiere

I agree with the feeling and the general angle suggested here.

I'd like to work on docs, example, and usability of expressions before we move on this problem.

GaelVaroquaux avatar Jul 01 '25 09:07 GaelVaroquaux