timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

Add support for JOINing time series via "AS OF" time

Open min-mwei opened this issue 6 years ago • 22 comments

For time series data, a powerful function is asof pioneered in kx/kdb, also being implemented by Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html#whatsnew-0190-enhancements-asof-merge

It would be really cool if Timescaledb could support it natively instead of having write 3 tricky queries to join two time series tables.

min-mwei avatar Oct 21 '17 20:10 min-mwei

Definitely an analytical function on our short list. Thanks for the request.

(And if others would similar like this, please upvote!)

mfreed avatar Oct 21 '17 22:10 mfreed

Hi, is there an update on support for an As Of Join. I am looking to test out various Timeseries solutions and AsOf is something we would really like to support.

@mfreed

ydesai0830 avatar Nov 05 '18 22:11 ydesai0830

Hi,

any plan for this feature? It is major point for financial (tick) data use case, so it can help to adopt timescale in this niche.. Another reference from leading db used in this market..: https://code.kx.com/wiki/Reference/aj

Thanks!

oldrichsmejkal avatar Jan 14 '19 15:01 oldrichsmejkal

I agree. Having a performant global "AS OF" on a traditional RDMS like postgres would be a holy grail. Maintaining history today with postgres is, hmm, painful. This alone has made me seriously consider using Datomic (https://docs.datomic.com/on-prem/clojure/index.html#datomic.api/as-of). A dream query for me would be writing:

SELECT foo.id, bar.baz, sup.sop FROM foo
JOIN bar ON ...
JOIN sup ON ...
WHERE ...
LIMIT By 42
AS OF 2017-11-12

and getting the state of the world from that datetime. I realise this is moving from time series and into another territory which might not be a good fit for timescaledb. But a man can hope :)

runekaagaard avatar Jan 14 '19 18:01 runekaagaard

Any updates or best practices?

franz101 avatar Sep 23 '19 01:09 franz101

This seems to be a possibility? https://dba.stackexchange.com/posts/185372/revisions

franz101 avatar Sep 23 '19 01:09 franz101

This is the best I've managed to get so far:

SELECT * FROM table_a
CROSS JOIN LATERAL (
  SELECT * FROM table_b
  WHERE (
    ...
    AND
    table_a.time > table_b.time
  )
  ORDER BY table_b.time DESC LIMIT 1
) lookup
ORDER BY table_a.time;

Still not very performant though, takes >1s on two tables of ~200k rows (pandas can do it >50x faster).

mridsole avatar Oct 14 '19 05:10 mridsole

Are there any plans to add this? Would be a really big deal for financial applications, among many others

EgorKraevTransferwise avatar Jul 17 '20 15:07 EgorKraevTransferwise

+1

Any update here? Is there an ETA for as-of join support?

kzk2000 avatar Mar 20 '21 22:03 kzk2000

Very under the (my) radar, mariaDB suddenly has very cool looking support for AS OF time machining:

https://mariadb.com/kb/en/system-versioned-tables/

runekaagaard avatar Mar 21 '21 18:03 runekaagaard

@runekaagaard thanks, so no ETA?

Just to be crystal clear: We are referring to as-of joins and not AS OF for data versioning.

Good example for AS OF JOINS is here: https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html --> see bottom of that page which illustrates how to join market trades & quotes

The kdb+ equivalent command is "aj", see https://code.kx.com/q/ref/aj/

If I misunderstood the purpose of this issue here, let me know and I'm happy to open another one

kzk2000 avatar Mar 21 '21 18:03 kzk2000

The addition of as-of joins within Timescale would be a huge benefit to us. We have lots of high frequency time series datasets from environmental sensors that have slightly differing timestamps (at the ms scale) and occasional gaps. Our requirements are pretty much met by the functionality described in the pandas.merge_asof documentation but it would be great if we could do the same sort of thing at the database level. A possible addition would be to allow for aligning the sensor time series datasets against a strict timeline (e.g. every exact second, minute etc.).

alex-tate avatar Mar 29 '21 15:03 alex-tate

The simple way to do this is to write the following sort of join:

CREATE TABLE foo ( time timestamptz, id int, val double precision);

CREATE INDEX on foo(id, time DESC);

SELECT t1.time, t1.value as t1_val, t2.value as t2_val
FROM foo AS t1, 
LATERAL (
  SELECT value
  FROM foo t2
  WHERE t2.id = 2 AND t2.time <= t1.time
  ORDER BY t2.time DESC
  LIMIT 1
) t2
WHERE t1.id = 1
ORDER BY t1.time;

You will definitely want the index on id, time DESC there, as that will make it much more efficient.

In cases where the tables are separate it will be slightly different, simpler in some ways as it involves less aliasing:


CREATE TABLE foo1 ( time timestamptz, val double precision);
CREATE TABLE foo2( time timestamptz, val double precision)

CREATE INDEX on foo1(time DESC);
CREATE INDEX ON foo2 (time DESC) INCLUDE (value);

SELECT foo1.time, foo1.value as foo1_val, foo2.value as foo2_val
FROM foo1,  
LATERAL (
  SELECT foo2.value
  FROM foo2 
  WHERE foo2.time <= foo1.time
  ORDER BY foo2.time DESC
  LIMIT 1
) foo2
ORDER BY foo1.time;

This may not always be the most efficient, but it should work reasonably well in smallish cases. We'll also work on some ways of doing this with the timeseries API as discussed: https://github.com/timescale/timescale-analytics/issues/162 that may be more efficient in some cases.

davidkohn88 avatar Jun 21 '21 17:06 davidkohn88

note: the LATERAL query above benefits significantly from being able to perform an index-only scan on foo2, you would want an index like

CREATE INDEX ON foo2 (time DESC) INCLUDE (value);

JLockerman avatar Jun 21 '21 18:06 JLockerman

Thank you @davidkohn88 this is a fantastic solution suggestion. 👍

NunoFilipeSantos avatar Jun 22 '21 11:06 NunoFilipeSantos

I am also very interested in this "as_of" join, what would be the best solution for now?

enthusiastics avatar Sep 10 '21 00:09 enthusiastics

I am also very interested in this "as_of" join, what would be the best solution for now?

The joins above ( https://github.com/timescale/timescaledb/issues/271#issuecomment-865231568) are reasonable solutions for now and should be reasonably performant, depending on what you're doing, we're also thinking about adding some more functionality around this in the toolkit, but it's probably a bit of a ways off, but you can add comments in this issue for now, and maybe explain more what exactly you're trying to achieve and what you think we should prioritize: https://github.com/timescale/timescaledb-toolkit/issues/162

davidkohn88 avatar Sep 10 '21 16:09 davidkohn88

With LATERAL join on timestamp inequality, I guess the right side is going to do index lookups, but it's still O(left_rows * log(right_rows)). A merge-join-like algorithm will be O(left_rows + right_rows).

An older attempt of implementing this in vanilla postgres: https://www.postgresql.org/message-id/flat/bc494762-26bd-b100-e1f9-a97901ddad57%40postgrespro.ru

For the reference, ClickHouse uses special grammar for this: https://clickhouse.com/docs/en/sql-reference/statements/select/join/#asof-join-usage

akuzm avatar Sep 22 '21 09:09 akuzm

The different grammar is probably required because the semantics is different from the normal join on inequality condition -- we only have to return the closest righthand row that matches. In an extension, we can't introduce new grammar (or can we?), so we can consider using a special dummy function for the join, e.g. JOIN ON left.series = right.series AND timescale.asof(left.timestamp <= right.ts).

I had an old patch that extended the merge join executor to support full join on inequality: https://www.postgresql.org/message-id/flat/[email protected] I think I could simplify and reuse it for ASOF joins.

akuzm avatar Sep 27 '21 12:09 akuzm

Wrote a design memo for internal use: https://docs.google.com/document/d/1YEX038V-gq-iLfM-KpLV0zpNn3yW88I9qPVa0_bs4aQ/edit

akuzm avatar Oct 19 '21 11:10 akuzm

The discussion upstream: https://www.postgresql.org/message-id/flat/CALzhyqwuVz0FJZ-oCYQ9d%2ByrPrbF5a9HDyAjxuSUdgq8n7nshQ%40mail.gmail.com

akuzm avatar Dec 14 '21 14:12 akuzm

Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.

alex-tate avatar Jul 24 '22 20:07 alex-tate

Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.

We're planning to try and prototype this inside the TimescaleDB extension in the Q4 of 2022, using the no. 3 merge-hash algorithm I posted upstream. That's just a research prototype, can't promise when and if we will release something that is actually usable.

akuzm avatar Oct 10 '22 12:10 akuzm

We have found the “as-of” join to be essential in our analyses involving high frequency financial data (market data, trade data etc.). The approach we present in the link below seems to be very fast and execution time appears to scale linearly with row count (n), unlike lateral join type approaches that typically scale as O(n^2). The SQL query we link to could perhaps be further optimized but we think that this is a promising approach in general.

https://gist.github.com/RMB-eQuant/758539f8914f2dd4461ec0ce144b048b

The table below compares the execution time of our approach (called “UNION ALL ALGO”) to the lateral join approach presented in the post https://github.com/timescale/timescaledb/issues/271#issuecomment-865231568 (“LATERAL JOIN”). Execution time is in seconds and the benchmarks were run on a Timescale-pro-100-16gb-2cpu-compute-optimized (2 CPU, 16 GB RAM, 100 GB storage) instance. In the table below, NaN values correspond to runs that were too slow to complete.

image

RMB-eQuant avatar Nov 23 '22 09:11 RMB-eQuant