DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Feature request: asofjoin

Open natemcintosh opened this issue 4 years ago • 20 comments

This is a request for an asof join type, as seen in pandas, kdb+, QuestDB, and probably a few others.

This is a rather niche join, but extremely useful for time series work. I think the pandas' description sums it up pretty well:

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key. For each row in the left DataFrame:

  • A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
  • A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
  • A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

The default is “backward” and is compatible in versions below 0.20.0. The direction parameter was added in version 0.20.0 and introduces “forward” and “nearest”. Optionally match on equivalent keys with ‘by’ before searching with ‘on’.

natemcintosh avatar Apr 27 '21 18:04 natemcintosh

OK - request noted. Let us wait for others to comment. Currently most of development is focused on other areas (performance), but we can get back to it in the future.

bkamins avatar Apr 27 '21 19:04 bkamins

Is it related with this ? https://discourse.julialang.org/t/merge-dataframes-where-one-value-is-between-two-others/34738

skanskan avatar Apr 28 '21 10:04 skanskan

It is very similar, but I believe there is one key difference: the "right" dataframe only has one timestamp, not a start timestamp and an end timestamp.

A super simple example (without a groupby option) is

left = DataFrame(time=[1,5,10], left_val=["a","b","c"])
right = DataFrame(time=[1,2,3,6,7], right_val=[1,2,3,6,7])

# for each time in `left`, find the time (if any) in `right` that 
# is equal to or before it
right_inds = [searchsortedlast(right.time, t) for t in left.time]
# returns right_inds = [1,3,5]

# concatenate the dataframes
df = hcat(left, right[right_inds, :], makeunique=true)

The pandas version of this function also does forward and nearest selection, but backward selection is probably used most often

natemcintosh avatar Apr 28 '21 12:04 natemcintosh

I agree this function would be very useful. It is definitely a niche to financial data, though I am sure some other fields use it. After asking that question and writing an initial function, I modified it to include a minimize component (which works for asof if the ranges are correct): https://gist.github.com/junder873/ae1ec240f2fd6d4cbdb2ec19bc2b5f9c

I am sure my code is sloppy, so I would love a more formal solution to these types of problems.

junder873 avatar May 25 '21 16:05 junder873

For reference, here are the kdb docs for asof and window joins: https://code.kx.com/q/ref/aj/ https://code.kx.com/q/ref/wj/

kdb is very well designed and optimized for filtering, joining, and basic data operations on financial market data. It's definitely worth a look for inspiration for any dataframes library. I still strongly prefer its syntax vs pandas or Julia's dataframes.

robsmith11 avatar Jun 05 '21 08:06 robsmith11

I agree this function would be very useful. It is definitely a niche to financial data, though I am sure some other fields use it. After asking that question and writing an initial function, I modified it to include a minimize component (which works for asof if the ranges are correct): https://gist.github.com/junder873/ae1ec240f2fd6d4cbdb2ec19bc2b5f9c

I am sure my code is sloppy, so I would love a more formal solution to these types of problems.

I understand how range_join performs the join by putting between two other times, but how does it work for an asof join, where you only want a row before or equal to a certain time?

natemcintosh avatar Jul 27 '21 13:07 natemcintosh

It would be using two different components of the function I posted, the conditions component and the minimize component:

df1 = DataFrame(
    id=1:2,
    date_left=[Date(2020, 1, 5), Date(2020, 5, 3)]
)
df2 = DataFrame(
    id=[1, 1, 1, 2, 2],
    date_right=[Date(2020, 1, 1), Date(2020, 1, 4), Date(2020, 1, 6), Date(2019, 1, 1), Date(2020, 5, 1)],
    val=1:5
)



range_join(
    df1,
    df2,
    [:id],
    Tuple{Function, Symbol, Symbol}[
        (>, :date_left, :date_right)
    ],
    minimize=[:date_left => :date_right]
)

junder873 avatar Aug 02 '21 15:08 junder873

So I would propose the following (we would need to think about the best name though; technically we could add this extension to leftjoin as it is not breaking, but I was not sure if it is not better to add a new name as the functionality is more complex, although leftjoin is a special case of it; potentially we could add this API to all joins but I am not sure if it is worth it - let us discuss):

rulejoin(df1, df2; on, condition=(x...) -> all(t -> isequal(t[1], t[2]), x), makeunique=false, source=nothing, validate=(false, false), renamecols=(identity => identity), window=nothing, keep=:all)

This would behave like leftjoin except that new condition, window and keep kwargs are added.

  • condition: a function taking as many positional arguments as are on columns, each positional argument is a tuple of values from left and right table; by default it checks if values from left and right table are equal (so this is the default leftjoin); we could add predefined functions for some standard conditions (like maximum absolute distance, greater than etc.)
  • keep: which matching rows from the right table to keep. By default :all to keep all rows. Other options are :first - keep first matching row, :last - keep last matching row, or a function, in this case it is taking as many positional arguments as are on columns, each positional argument is a tuple of values from left and right table and it should return values that are comparable and have total order; the returned value is the first row that is a minimal element of the set of matching rows;
  • window:
    • if nothing then the operation is O(n*m) - just all combinations of left and right table are checked; the result is produced in the order of rows from df1
    • if :ascending then df1 and df2 are sorted before operation and it is guaranteed by the user that condition is a "window function" (i.e. as you move along the condition after data frames are sorted the matching rows are a continuous block and both begin and end of the block are increasing); in this case, after sorting the cost of matching is O(n+m). Sorting is done only if needed in particular (so
    • if :descending then the behavior is the same as for :ascending, but with rev=true when sorting

Adding this feature will cover all kinds of joins on arbitrary condition, with all kinds of "as of" and similar joins being a special case and fast.

What I am not sure about if it is not overly general (and thus not needed in practice). Maybe it is enough to ensure such matching only on a single column and just allow other on columns to be standard equijoin conditions?

CC @nalimilan

bkamins avatar Dec 11 '22 11:12 bkamins

@jariji - this is another important PR where API design is a key challenge.

bkamins avatar Dec 27 '22 21:12 bkamins

I'll just mention a related thing for now. Sometimes I want fuzzy joins like on=:birthdate, nearest=:name => levenshtein or on=:birthdate, filter=:name => <(4)∘levenshtein (reduces memory usage relative to two-step process).

jariji avatar Dec 27 '22 21:12 jariji

I was thinking of "nearest" join, but could not decide what to do with it (current proposal does not cover nearest join). If this is indeed needed, we could change condition from requiring to return Bool (is condition met) to distance and then add option threshold (or something like this), where only pairs that have distance less than threshold are considered as matching, and one of the options for threshold would be e.g. min (to select only nearest match).

bkamins avatar Dec 27 '22 22:12 bkamins

Hi all, I’ll share my two cents as a recent convert from R tidyverse to DataFrames.jl.

My two favorite features from the new join_by feature in R tidyverse are inequality joins, and rolling joins (see docs here: https://dplyr.tidyverse.org/reference/join_by.html).

The main difference between these is that inequality joins select all rows meeting a set of criteria, whereas rolling joins select the closest row. In the tidyverse implementation, you can choose to use the closest matches for all or only some of the criteria.

So in R the following code will select all rows where the id columns match and the sale_date in the sales data frame is >= promo_date in the promos data frame.

left_join(sales, promos, join_by(id == id, sale_date >= promo_date))

For rolling joins, this will select the closest match only. Note that you can separate multiple criteria with commas, and you can apply closest to only certain criteria.

left_join(sales, promos, join_by(id == id, closest(sale_date >= promo_date)))

What I would love to see in DataFrames.jl for inequality joins would be something like this:

leftjoin(sales, promos, ((df1, df2) -> df1.id .== df2.id .& df1.sale_date .>= df2.promo_date))

For rolling joins it might look like this:

What I would love to see in DataFrames.jl for inequality joins, it would be nice to do this using some helpful function, but even a keyword argument could work, like this:

leftjoin(sales, promos, ((df1, df2) -> df1.id .== df2.id .& df1.sale_date .>= df2.promo_date); match = :closest)

I totally get that this design probably violates the function-barrier, which could lead to performance problems, so another way to write this taking into account function barrier might be something like this (where the initial tuple inside a vector maps the respective columns to symbols from the respective data frames.

leftjoin(sales, promos, [(:id => :id1, :sale_date), (:id => :id2, :promo_date)] => 
                                      ((id1, sale_date, id2, promo_date) -> id1 .== id2 .& sale_date .>= promo_date))

Also, you potentially wouldn’t even need the match = :closest keyword argument because you could build that into the function itself, like this:

leftjoin(sales, promos, [(:id => :id1, :sale_date), (:id => :id2, :promo_date)] => 
                                      ((id1, sale_date, id2, promo_date) -> id1 .== id2 .& sale_date .>= promo_date .&
                                      (sale_date .- promo_date .== minimum(sale_date .- promo_date))

I haven’t looked through all of the alternative proposals above, so mea culpa if there is overlap. What I like about this is that you could specify all kinds of things within the function so it would be quite flexible. It is verbose for sure. Also, I’m new at this so feel free to take that into consideration as to whether this is good/bad design.

kdpsingh avatar Mar 22 '23 19:03 kdpsingh

I've used this package for various time series joins: https://gitlab.com/aplavin/FlexiJoins.jl

The syntax takes a little getting used to, but it's pretty powerful and performance is decent, whether joining DataFrames or other types.

robsmith11 avatar Mar 23 '23 00:03 robsmith11

I've had issues with getting the multi keyword argument to work with DataFrames using FlexiJoins.jl. Also, I don't like the behavior that results in duplicated column names for the keys being joined on. But yes, I do think the syntax is totally reasonable if those issues were cleaned up or clarified.

kdpsingh avatar Mar 23 '23 00:03 kdpsingh

@kdpsingh Here's how I use the multi keyword to do an asof join on DataFrames:

julia> t1 = DataFrame(t=rand(10), a=rand(10)); t2 = DataFrame(t=rand(100), b=rand(100));

julia> leftjoin((t1, t2), by_pred(:t, >=, :t), multi=(identity,closest))
10×4 DataFrame
 Row │ t          a         t_1              b
     │ Float64    Float64   Float64?         Float64?
─────┼───────────────────────────────────────────────────────
   1 │ 0.026362   0.513261        0.0181962        0.724834
   2 │ 0.749699   0.149604        0.725635         0.968237
   3 │ 0.195397   0.260313        0.189569         0.820139
   4 │ 0.491868   0.891044        0.470237         0.841432
   5 │ 0.901302   0.168646        0.872977         0.227124
   6 │ 0.35446    0.740321        0.351904         0.385458
   7 │ 0.370695   0.555825        0.354942         0.196156
   8 │ 0.914361   0.991622        0.907086         0.792908
   9 │ 0.287491   0.538382        0.287004         0.0516505
  10 │ 0.0106906  0.180112  missing          missing

robsmith11 avatar Mar 23 '23 16:03 robsmith11

Thanks! That's super helpful. Quick question: what is the purpose of identity in the multi argument?

I think in that case my only critique of the FlexiJoins.jl implementation is that when matching on name (or equality by pred), there's no need to create a and a_1. For inequality joins, it makes sense to have both (since they won't always match), but otherwise it creates an extra column that's not needed, especially when they have the same name to begin with.

Otherwise, I would be happy if this were incorporated as core functionality within DataFrames.jl. It's certainly less verbose than what I proposed.

kdpsingh avatar Mar 23 '23 17:03 kdpsingh

what is the purpose of identity in the multi argument?

My understanding is that it's just a placeholder, so that it's clear to which table the closest applies. For other data types, it's possible to pass multi a named tuple and then identity isn't needed, but for some reason that I forget, that won't work with DataFrames.

robsmith11 avatar Mar 24 '23 05:03 robsmith11

FlexiJoins are a general join implementation, working for various collection types - most commonly, some kind of array. Many collections support the common Julia interface (eg indexing, views), so they work as-is with FlexiJoins without any specific processing.

DataFrames don't follow the same interface, and they are dealt with a thin conversion layer: https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L40-57. This layer is considered experimental with details that can potentially change. I personally don't need dataframes and use simpler table types, so this conversion layer is very basic for now. Please suggest concrete improvements if you have a better understanding of what DataFrames users would expect from joining.

Also, I don't like the behavior that results in duplicated column names for the keys being joined on.

For regular Julia collections as the input, the result is nested - see examples at https://aplavin.github.io/FlexiJoins.jl/test/examples.html. Dataframes are fundamentally flat, this results in duplicates if no postprocessing is done (as it is now).

For other data types, it's possible to pass multi a named tuple and then identity isn't needed, but for some reason that I forget, that won't work with DataFrames.

It just wasn't clear what the semantic for passing two dataframes as a namedtuple should be - so I didn't implement this. Feel free to suggest any changes to the conversion layer, I don't have strong opinions on how dataframes should behave.

what is the purpose of identity in the multi argument?

multi indicates what to do with duplicate matches. As in regular Julia code, identity means "don't do anything, keep all", last - "take the last match", closest (special value) - take the closest when it makes sense.

aplavin avatar Apr 03 '23 14:04 aplavin

Thanks @aplavin! Appreciate the answers. None of what I said was intended as criticism - it's more stuff that DataFrames.jl maintainers should consider as they think about the best way to support flexible joins using DataFrames. I totally recognize that your package works with a much broader set of data structures.

kdpsingh avatar Apr 04 '23 00:04 kdpsingh

@kdpsingh I just wanted to explain why the situation with dataframes in flexijoins is as it is, and to suggest a direct way to improve it right now if one has better understanding of expectations. The tone is sometimes challenging in written communication - I didn't take your questions as criticism anyway :)

aplavin avatar Apr 04 '23 08:04 aplavin