polars icon indicating copy to clipboard operation
polars copied to clipboard

Merge two sorted dataframes into new sorted dataframe

Open marsupialtail opened this issue 1 year ago • 4 comments

Problem description

Imagine there are two dataframes both already sorted by key column, i.e. a and b sorted by column k.

I would like to speedup the following polars.concat([a,b]).sort(k).

I believe a lot of code from the sort-merge join recently added to Polars can be directly used here.

This is kinda like a sorted outer-join, but not exactly.

marsupialtail avatar Oct 24 '22 20:10 marsupialtail

@ritchie46 I will upgrade my sponsor level if you implement this :1st_place_medal:

marsupialtail avatar Nov 03 '22 17:11 marsupialtail

This will be added by #5817

ritchie46 avatar Dec 20 '22 21:12 ritchie46

Seems like I have to upgrade my sponsor level then

marsupialtail avatar Dec 20 '22 21:12 marsupialtail

Wait where is the Python API?

marsupialtail avatar Dec 20 '22 21:12 marsupialtail

Wait where is the Python API?

Still working on it ;)

ritchie46 avatar Dec 23 '22 12:12 ritchie46

I have updated my sponsorship level as promised.

marsupialtail avatar Dec 31 '22 19:12 marsupialtail

Hi @ritchie46, thanks for the implementation -- super helpful feature!

I have a quick question if you don't mind. I'm wondering if you have any insight on how this function performs if the goal is to do the analogous problem with n sorted DataFrames instead of just the two.

Is the most efficient way to do this just df1.merge_sorted(df2).merge_sorted(df3).merge_sorted(df4) and so on, or perhaps some sort of divide and conquer approach, like

x = df1.merge_sorted(df2);
y = df3.merge_sorted(df4);
x.merge_sorted(y)

to minimize the maximum DataFrame size?

edit: I suppose the ideal case would to have merge_sorted actually accept a list of inputs, similar to concat, but not sure how much of a direct lift that is

george-qi avatar Apr 16 '23 17:04 george-qi