polars
polars copied to clipboard
Merge two sorted dataframes into new sorted dataframe
Problem description
Imagine there are two dataframes both already sorted by key column, i.e. a and b sorted by column k.
I would like to speedup the following polars.concat([a,b]).sort(k).
I believe a lot of code from the sort-merge join recently added to Polars can be directly used here.
This is kinda like a sorted outer-join, but not exactly.
@ritchie46 I will upgrade my sponsor level if you implement this :1st_place_medal:
This will be added by #5817
Seems like I have to upgrade my sponsor level then
Wait where is the Python API?
Wait where is the Python API?
Still working on it ;)
I have updated my sponsorship level as promised.
Hi @ritchie46, thanks for the implementation -- super helpful feature!
I have a quick question if you don't mind. I'm wondering if you have any insight on how this function performs if the goal is to do the analogous problem with n
sorted DataFrames instead of just the two.
Is the most efficient way to do this just df1.merge_sorted(df2).merge_sorted(df3).merge_sorted(df4)
and so on, or perhaps some sort of divide and conquer approach, like
x = df1.merge_sorted(df2);
y = df3.merge_sorted(df4);
x.merge_sorted(y)
to minimize the maximum DataFrame size?
edit: I suppose the ideal case would to have merge_sorted
actually accept a list of inputs, similar to concat
, but not sure how much of a direct lift that is