DataFrames.jl
DataFrames.jl copied to clipboard
Feature request: asofjoin
This is a request for an asof join type, as seen in pandas, kdb+, QuestDB, and probably a few others.
This is a rather niche join, but extremely useful for time series work. I think the pandas' description sums it up pretty well:
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key. For each row in the left DataFrame:
- A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
- A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
- A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
The default is “backward” and is compatible in versions below 0.20.0. The direction parameter was added in version 0.20.0 and introduces “forward” and “nearest”. Optionally match on equivalent keys with ‘by’ before searching with ‘on’.
OK - request noted. Let us wait for others to comment. Currently most of development is focused on other areas (performance), but we can get back to it in the future.
Is it related with this ? https://discourse.julialang.org/t/merge-dataframes-where-one-value-is-between-two-others/34738
It is very similar, but I believe there is one key difference: the "right" dataframe only has one timestamp, not a start timestamp and an end timestamp.
A super simple example (without a groupby option) is
left = DataFrame(time=[1,5,10], left_val=["a","b","c"])
right = DataFrame(time=[1,2,3,6,7], right_val=[1,2,3,6,7])
# for each time in `left`, find the time (if any) in `right` that
# is equal to or before it
right_inds = [searchsortedlast(right.time, t) for t in left.time]
# returns right_inds = [1,3,5]
# concatenate the dataframes
df = hcat(left, right[right_inds, :], makeunique=true)
The pandas version of this function also does forward and nearest selection, but backward selection is probably used most often
I agree this function would be very useful. It is definitely a niche to financial data, though I am sure some other fields use it. After asking that question and writing an initial function, I modified it to include a minimize component (which works for asof if the ranges are correct): https://gist.github.com/junder873/ae1ec240f2fd6d4cbdb2ec19bc2b5f9c
I am sure my code is sloppy, so I would love a more formal solution to these types of problems.
For reference, here are the kdb docs for asof and window joins: https://code.kx.com/q/ref/aj/ https://code.kx.com/q/ref/wj/
kdb is very well designed and optimized for filtering, joining, and basic data operations on financial market data. It's definitely worth a look for inspiration for any dataframes library. I still strongly prefer its syntax vs pandas or Julia's dataframes.
I agree this function would be very useful. It is definitely a niche to financial data, though I am sure some other fields use it. After asking that question and writing an initial function, I modified it to include a minimize component (which works for asof if the ranges are correct): https://gist.github.com/junder873/ae1ec240f2fd6d4cbdb2ec19bc2b5f9c
I am sure my code is sloppy, so I would love a more formal solution to these types of problems.
I understand how range_join performs the join by putting between two other times, but how does it work for an asof join, where you only want a row before or equal to a certain time?
It would be using two different components of the function I posted, the conditions component and the minimize component:
df1 = DataFrame(
id=1:2,
date_left=[Date(2020, 1, 5), Date(2020, 5, 3)]
)
df2 = DataFrame(
id=[1, 1, 1, 2, 2],
date_right=[Date(2020, 1, 1), Date(2020, 1, 4), Date(2020, 1, 6), Date(2019, 1, 1), Date(2020, 5, 1)],
val=1:5
)
range_join(
df1,
df2,
[:id],
Tuple{Function, Symbol, Symbol}[
(>, :date_left, :date_right)
],
minimize=[:date_left => :date_right]
)
So I would propose the following (we would need to think about the best name though; technically we could add this extension to leftjoin as it is not breaking, but I was not sure if it is not better to add a new name as the functionality is more complex, although leftjoin is a special case of it; potentially we could add this API to all joins but I am not sure if it is worth it - let us discuss):
rulejoin(df1, df2; on, condition=(x...) -> all(t -> isequal(t[1], t[2]), x), makeunique=false, source=nothing, validate=(false, false), renamecols=(identity => identity), window=nothing, keep=:all)
This would behave like leftjoin except that new condition, window and keep kwargs are added.
condition: a function taking as many positional arguments as areoncolumns, each positional argument is a tuple of values from left and right table; by default it checks if values from left and right table are equal (so this is the defaultleftjoin); we could add predefined functions for some standard conditions (like maximum absolute distance, greater than etc.)keep: which matching rows from the right table to keep. By default:allto keep all rows. Other options are:first- keep first matching row,:last- keep last matching row, or a function, in this case it is taking as many positional arguments as areoncolumns, each positional argument is a tuple of values from left and right table and it should return values that are comparable and have total order; the returned value is the first row that is a minimal element of the set of matching rows;window:- if
nothingthen the operation is O(n*m) - just all combinations of left and right table are checked; the result is produced in the order of rows fromdf1 - if
:ascendingthendf1anddf2are sorted before operation and it is guaranteed by the user thatconditionis a "window function" (i.e. as you move along the condition after data frames are sorted the matching rows are a continuous block and both begin and end of the block are increasing); in this case, after sorting the cost of matching is O(n+m). Sorting is done only if needed in particular (so - if
:descendingthen the behavior is the same as for:ascending, but withrev=truewhen sorting
- if
Adding this feature will cover all kinds of joins on arbitrary condition, with all kinds of "as of" and similar joins being a special case and fast.
What I am not sure about if it is not overly general (and thus not needed in practice). Maybe it is enough to ensure such matching only on a single column and just allow other on columns to be standard equijoin conditions?
CC @nalimilan
@jariji - this is another important PR where API design is a key challenge.
I'll just mention a related thing for now. Sometimes I want fuzzy joins like on=:birthdate, nearest=:name => levenshtein or on=:birthdate, filter=:name => <(4)∘levenshtein (reduces memory usage relative to two-step process).
I was thinking of "nearest" join, but could not decide what to do with it (current proposal does not cover nearest join).
If this is indeed needed, we could change condition from requiring to return Bool (is condition met) to distance and then add option threshold (or something like this), where only pairs that have distance less than threshold are considered as matching, and one of the options for threshold would be e.g. min (to select only nearest match).
Hi all, I’ll share my two cents as a recent convert from R tidyverse to DataFrames.jl.
My two favorite features from the new join_by feature in R tidyverse are inequality joins, and rolling joins (see docs here: https://dplyr.tidyverse.org/reference/join_by.html).
The main difference between these is that inequality joins select all rows meeting a set of criteria, whereas rolling joins select the closest row. In the tidyverse implementation, you can choose to use the closest matches for all or only some of the criteria.
So in R the following code will select all rows where the id columns match and the sale_date in the sales data frame is >= promo_date in the promos data frame.
left_join(sales, promos, join_by(id == id, sale_date >= promo_date))
For rolling joins, this will select the closest match only. Note that you can separate multiple criteria with commas, and you can apply closest to only certain criteria.
left_join(sales, promos, join_by(id == id, closest(sale_date >= promo_date)))
What I would love to see in DataFrames.jl for inequality joins would be something like this:
leftjoin(sales, promos, ((df1, df2) -> df1.id .== df2.id .& df1.sale_date .>= df2.promo_date))
For rolling joins it might look like this:
What I would love to see in DataFrames.jl for inequality joins, it would be nice to do this using some helpful function, but even a keyword argument could work, like this:
leftjoin(sales, promos, ((df1, df2) -> df1.id .== df2.id .& df1.sale_date .>= df2.promo_date); match = :closest)
I totally get that this design probably violates the function-barrier, which could lead to performance problems, so another way to write this taking into account function barrier might be something like this (where the initial tuple inside a vector maps the respective columns to symbols from the respective data frames.
leftjoin(sales, promos, [(:id => :id1, :sale_date), (:id => :id2, :promo_date)] =>
((id1, sale_date, id2, promo_date) -> id1 .== id2 .& sale_date .>= promo_date))
Also, you potentially wouldn’t even need the match = :closest keyword argument because you could build that into the function itself, like this:
leftjoin(sales, promos, [(:id => :id1, :sale_date), (:id => :id2, :promo_date)] =>
((id1, sale_date, id2, promo_date) -> id1 .== id2 .& sale_date .>= promo_date .&
(sale_date .- promo_date .== minimum(sale_date .- promo_date))
I haven’t looked through all of the alternative proposals above, so mea culpa if there is overlap. What I like about this is that you could specify all kinds of things within the function so it would be quite flexible. It is verbose for sure. Also, I’m new at this so feel free to take that into consideration as to whether this is good/bad design.
I've used this package for various time series joins: https://gitlab.com/aplavin/FlexiJoins.jl
The syntax takes a little getting used to, but it's pretty powerful and performance is decent, whether joining DataFrames or other types.
I've had issues with getting the multi keyword argument to work with DataFrames using FlexiJoins.jl. Also, I don't like the behavior that results in duplicated column names for the keys being joined on. But yes, I do think the syntax is totally reasonable if those issues were cleaned up or clarified.
@kdpsingh
Here's how I use the multi keyword to do an asof join on DataFrames:
julia> t1 = DataFrame(t=rand(10), a=rand(10)); t2 = DataFrame(t=rand(100), b=rand(100));
julia> leftjoin((t1, t2), by_pred(:t, >=, :t), multi=(identity,closest))
10×4 DataFrame
Row │ t a t_1 b
│ Float64 Float64 Float64? Float64?
─────┼───────────────────────────────────────────────────────
1 │ 0.026362 0.513261 0.0181962 0.724834
2 │ 0.749699 0.149604 0.725635 0.968237
3 │ 0.195397 0.260313 0.189569 0.820139
4 │ 0.491868 0.891044 0.470237 0.841432
5 │ 0.901302 0.168646 0.872977 0.227124
6 │ 0.35446 0.740321 0.351904 0.385458
7 │ 0.370695 0.555825 0.354942 0.196156
8 │ 0.914361 0.991622 0.907086 0.792908
9 │ 0.287491 0.538382 0.287004 0.0516505
10 │ 0.0106906 0.180112 missing missing
Thanks! That's super helpful. Quick question: what is the purpose of identity in the multi argument?
I think in that case my only critique of the FlexiJoins.jl implementation is that when matching on name (or equality by pred), there's no need to create a and a_1. For inequality joins, it makes sense to have both (since they won't always match), but otherwise it creates an extra column that's not needed, especially when they have the same name to begin with.
Otherwise, I would be happy if this were incorporated as core functionality within DataFrames.jl. It's certainly less verbose than what I proposed.
what is the purpose of identity in the multi argument?
My understanding is that it's just a placeholder, so that it's clear to which table the closest applies. For other data types, it's possible to pass multi a named tuple and then identity isn't needed, but for some reason that I forget, that won't work with DataFrames.
FlexiJoins are a general join implementation, working for various collection types - most commonly, some kind of array. Many collections support the common Julia interface (eg indexing, views), so they work as-is with FlexiJoins without any specific processing.
DataFrames don't follow the same interface, and they are dealt with a thin conversion layer: https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L40-57. This layer is considered experimental with details that can potentially change. I personally don't need dataframes and use simpler table types, so this conversion layer is very basic for now. Please suggest concrete improvements if you have a better understanding of what DataFrames users would expect from joining.
Also, I don't like the behavior that results in duplicated column names for the keys being joined on.
For regular Julia collections as the input, the result is nested - see examples at https://aplavin.github.io/FlexiJoins.jl/test/examples.html. Dataframes are fundamentally flat, this results in duplicates if no postprocessing is done (as it is now).
For other data types, it's possible to pass multi a named tuple and then identity isn't needed, but for some reason that I forget, that won't work with DataFrames.
It just wasn't clear what the semantic for passing two dataframes as a namedtuple should be - so I didn't implement this. Feel free to suggest any changes to the conversion layer, I don't have strong opinions on how dataframes should behave.
what is the purpose of identity in the multi argument?
multi indicates what to do with duplicate matches. As in regular Julia code, identity means "don't do anything, keep all", last - "take the last match", closest (special value) - take the closest when it makes sense.
Thanks @aplavin! Appreciate the answers. None of what I said was intended as criticism - it's more stuff that DataFrames.jl maintainers should consider as they think about the best way to support flexible joins using DataFrames. I totally recognize that your package works with a much broader set of data structures.
@kdpsingh I just wanted to explain why the situation with dataframes in flexijoins is as it is, and to suggest a direct way to improve it right now if one has better understanding of expectations. The tone is sometimes challenging in written communication - I didn't take your questions as criticism anyway :)