tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Enhancement request: IntervalDF.disjoint apply on more than 2 metrics

Open yurut opened this issue 2 years ago • 3 comments

I am playing around IntervalDF.disjoint function but I found it shows bug when I need to decouple over more than 2 metrics.

Real code here:

# this dataframe should have overlapped time intervals  00:00:14 to 00:00:16 for all the metrics
df = spark.createDataFrame(
    [["2020-08-01 00:00:10", "2020-08-01 00:00:16", "s1", 1, "id123"],
     ["2020-08-01 00:00:09", "2020-08-01 00:00:11", "s1", 1, "id123"],
     ["2020-08-01 00:00:14", "2020-08-01 00:00:17", "s2", 1, "id123"],
     ["2020-08-01 00:00:14", "2020-08-01 00:00:21", "s3", 1, "id123"],
     ["2020-08-01 00:00:14", "2020-08-01 00:00:16", "s4", 1, "id123"],
],
    "start_timestamp STRING, end_timestamp STRING, signal_name STRING, signal_value INT, id STRING",
)

# real result using tempo library
idf = IntervalsDF.fromStackedMetrics(df, "start_timestamp", "end_timestamp", ["id"], "signal_name", "signal_value", ["s1", "s2", "s3", "s4"])
result_df = idf.disjoint().toDF().sort("start_timestamp")
result_df.show()

# +-------------------+-------------------+-----+----+----+----+----+
# |    start_timestamp|      end_timestamp|   id|  s1|  s2|  s3|  s4|
# +-------------------+-------------------+-----+----+----+----+----+
# |2020-08-01 00:00:09|2020-08-01 00:00:10|id123|   1|null|null|null|
# |2020-08-01 00:00:10|2020-08-01 00:00:14|id123|   1|null|null|null|
# |2020-08-01 00:00:10|2020-08-01 00:00:11|id123|   1|null|null|null|
# |2020-08-01 00:00:14|2020-08-01 00:00:16|id123|   1|   1|null|   1|
# |2020-08-01 00:00:14|2020-08-01 00:00:17|id123|null|   1|   1|null|
# |2020-08-01 00:00:16|2020-08-01 00:00:17|id123|null|   1|null|null|
# |2020-08-01 00:00:17|2020-08-01 00:00:21|id123|null|null|   1|null|
# +-------------------+-------------------+-----+----+----+----+----+



# The ideal result after applying fromStackedMetrics() and disjoint() should be
#+-------------------+-------------------+----+----+----+----+
#|    start_timestamp|      end_timestamp|  s1|  s2|  s3|  s4|
#+-------------------+-------------------+----+----+----+----+
#|2020-08-01 00:00:09|2020-08-01 00:00:14|   1|null|null|null|
#|2020-08-01 00:00:14|2020-08-01 00:00:16|   1|   1|   1|   1|
#|2020-08-01 00:00:16|2020-08-01 00:00:17|null|   1|   1|null|
#|2020-08-01 00:00:17|2020-08-01 00:00:21|null|null|   1|null|
#+-------------------+-------------------+----+----+----+----+

I found the result is correct when only 2 metrics were provided. If more than 2 metrics, the result is not optimized. It can be broken apart into several rows. As you can see from the above case, three rows noted the s1 across timestamp starting from 00:00:09 to 00:00:14 can be merged together. However the result is expected, cuz I think there's no group by related function in the source code.

Regarding to the intervals starting 00:00:14 to 00:00:16, all 4 metrics have the same intervals in this timespan but the result doesn't identify the overlap.

@R7L208 I think you are the best guy to turn to for this issue.

yurut avatar Oct 22 '22 01:10 yurut

Thanks @EvanstsaiTW! I will find some time this week to start looking into this

R7L208 avatar Oct 24 '22 13:10 R7L208

@EvanstsaiTW - looks like more than two, adjacent rows with the same start or end time is an edge case that's poorly handled right now. The three records starting with 2020-08-01 00:00:14 are what's causing the issue.

idf = IntervalsDF.fromStackedMetrics(
    df,
    "start_timestamp",
    "end_timestamp",
    ["id"],
    "signal_name",
    "signal_value",
    ["s1", "s2", "s3", "s4"],
)

idf.df.orderBy("start_timestamp", "end_timestamp").show()

# +-------------------+-------------------+-----+----+----+----+----+
# |    start_timestamp|      end_timestamp|   id|  s1|  s2|  s3|  s4|
# +-------------------+-------------------+-----+----+----+----+----+
# |2020-08-01 00:00:09|2020-08-01 00:00:11|id123|   1|null|null|null|
# |2020-08-01 00:00:10|2020-08-01 00:00:16|id123|   1|null|null|null|
# |2020-08-01 00:00:14|2020-08-01 00:00:16|id123|null|null|null|   1|
# |2020-08-01 00:00:14|2020-08-01 00:00:17|id123|null|   1|null|null|
# |2020-08-01 00:00:14|2020-08-01 00:00:21|id123|null|null|   1|null|
# +-------------------+-------------------+-----+----+----+----+----+

I'm thinking the best way to solve this is to check that start_timestamp and end_timestamp do not have duplicates per id before exiting disjoint. If duplicates are found, call disjoint again on the DataFrame until we resolve all intervals that have the same start_timestamp or `end_timestampe.

Using groupBy, we could easily construct the correct interval of 2020-08-01 00:00:14 to 2020-08-01 00:00:16 but then constructing the subsequent intervals of 2020-08-01 00:00:16 to 2020-08-01 00:00:17 and 2020-08-01 00:00:17 to 2020-08-01 00:00:21 becomes more difficult.

R7L208 avatar Oct 26 '22 14:10 R7L208

Wanted to drop another update that after iterating on a few ideas, we plan to re-implement this using Pandas UDFs to better handle edge cases. I'll start working on this in the coming weeks

R7L208 avatar Nov 29 '22 18:11 R7L208