tempo
tempo copied to clipboard
Enhancement request: IntervalDF.disjoint apply on more than 2 metrics
I am playing around IntervalDF.disjoint function but I found it shows bug when I need to decouple over more than 2 metrics.
Real code here:
# this dataframe should have overlapped time intervals 00:00:14 to 00:00:16 for all the metrics
df = spark.createDataFrame(
[["2020-08-01 00:00:10", "2020-08-01 00:00:16", "s1", 1, "id123"],
["2020-08-01 00:00:09", "2020-08-01 00:00:11", "s1", 1, "id123"],
["2020-08-01 00:00:14", "2020-08-01 00:00:17", "s2", 1, "id123"],
["2020-08-01 00:00:14", "2020-08-01 00:00:21", "s3", 1, "id123"],
["2020-08-01 00:00:14", "2020-08-01 00:00:16", "s4", 1, "id123"],
],
"start_timestamp STRING, end_timestamp STRING, signal_name STRING, signal_value INT, id STRING",
)
# real result using tempo library
idf = IntervalsDF.fromStackedMetrics(df, "start_timestamp", "end_timestamp", ["id"], "signal_name", "signal_value", ["s1", "s2", "s3", "s4"])
result_df = idf.disjoint().toDF().sort("start_timestamp")
result_df.show()
# +-------------------+-------------------+-----+----+----+----+----+
# | start_timestamp| end_timestamp| id| s1| s2| s3| s4|
# +-------------------+-------------------+-----+----+----+----+----+
# |2020-08-01 00:00:09|2020-08-01 00:00:10|id123| 1|null|null|null|
# |2020-08-01 00:00:10|2020-08-01 00:00:14|id123| 1|null|null|null|
# |2020-08-01 00:00:10|2020-08-01 00:00:11|id123| 1|null|null|null|
# |2020-08-01 00:00:14|2020-08-01 00:00:16|id123| 1| 1|null| 1|
# |2020-08-01 00:00:14|2020-08-01 00:00:17|id123|null| 1| 1|null|
# |2020-08-01 00:00:16|2020-08-01 00:00:17|id123|null| 1|null|null|
# |2020-08-01 00:00:17|2020-08-01 00:00:21|id123|null|null| 1|null|
# +-------------------+-------------------+-----+----+----+----+----+
# The ideal result after applying fromStackedMetrics() and disjoint() should be
#+-------------------+-------------------+----+----+----+----+
#| start_timestamp| end_timestamp| s1| s2| s3| s4|
#+-------------------+-------------------+----+----+----+----+
#|2020-08-01 00:00:09|2020-08-01 00:00:14| 1|null|null|null|
#|2020-08-01 00:00:14|2020-08-01 00:00:16| 1| 1| 1| 1|
#|2020-08-01 00:00:16|2020-08-01 00:00:17|null| 1| 1|null|
#|2020-08-01 00:00:17|2020-08-01 00:00:21|null|null| 1|null|
#+-------------------+-------------------+----+----+----+----+
I found the result is correct when only 2 metrics were provided. If more than 2 metrics, the result is not optimized. It can be broken apart into several rows. As you can see from the above case, three rows noted the s1 across timestamp starting from 00:00:09 to 00:00:14
can be merged together. However the result is expected, cuz I think there's no group by related function in the source code.
Regarding to the intervals starting 00:00:14 to 00:00:16
, all 4 metrics have the same intervals in this timespan but the result doesn't identify the overlap.
@R7L208 I think you are the best guy to turn to for this issue.
Thanks @EvanstsaiTW! I will find some time this week to start looking into this
@EvanstsaiTW - looks like more than two, adjacent rows with the same start or end time is an edge case that's poorly handled right now. The three records starting with 2020-08-01 00:00:14
are what's causing the issue.
idf = IntervalsDF.fromStackedMetrics(
df,
"start_timestamp",
"end_timestamp",
["id"],
"signal_name",
"signal_value",
["s1", "s2", "s3", "s4"],
)
idf.df.orderBy("start_timestamp", "end_timestamp").show()
# +-------------------+-------------------+-----+----+----+----+----+
# | start_timestamp| end_timestamp| id| s1| s2| s3| s4|
# +-------------------+-------------------+-----+----+----+----+----+
# |2020-08-01 00:00:09|2020-08-01 00:00:11|id123| 1|null|null|null|
# |2020-08-01 00:00:10|2020-08-01 00:00:16|id123| 1|null|null|null|
# |2020-08-01 00:00:14|2020-08-01 00:00:16|id123|null|null|null| 1|
# |2020-08-01 00:00:14|2020-08-01 00:00:17|id123|null| 1|null|null|
# |2020-08-01 00:00:14|2020-08-01 00:00:21|id123|null|null| 1|null|
# +-------------------+-------------------+-----+----+----+----+----+
I'm thinking the best way to solve this is to check that start_timestamp
and end_timestamp
do not have duplicates per id
before exiting disjoint
. If duplicates are found, call disjoint
again on the DataFrame until we resolve all intervals that have the same start_timestamp
or `end_timestampe.
Using groupBy
, we could easily construct the correct interval of 2020-08-01 00:00:14
to 2020-08-01 00:00:16
but then constructing the subsequent intervals of 2020-08-01 00:00:16
to 2020-08-01 00:00:17
and 2020-08-01 00:00:17
to 2020-08-01 00:00:21
becomes more difficult.
Wanted to drop another update that after iterating on a few ideas, we plan to re-implement this using Pandas UDFs to better handle edge cases. I'll start working on this in the coming weeks