aw-server-rust
aw-server-rust copied to clipboard
aw-transform: Add union_events_split
One step in fixing the "Merge web watcher events into window events in the query? (to allow for classifying by url/domain)" https://github.com/ActivityWatch/aw-webui/issues/151. We already have it in the same query, but the data is not merged. Hopefully this will be a good transform to merge window and browser data in that manner.
Codecov Report
Merging #179 (77eef3d) into master (7d55fca) will increase coverage by
13.67%
. The diff coverage is5.06%
.
@@ Coverage Diff @@
## master #179 +/- ##
===========================================
+ Coverage 46.30% 59.97% +13.67%
===========================================
Files 51 44 -7
Lines 6148 4765 -1383
Branches 1454 0 -1454
===========================================
+ Hits 2847 2858 +11
+ Misses 2442 1907 -535
+ Partials 859 0 -859
Impacted Files | Coverage Δ | |
---|---|---|
aw-transform/src/union.rs | 0.00% <0.00%> (ø) |
|
aw-query/src/functions.rs | 82.92% <28.57%> (+16.05%) |
:arrow_up: |
aw-server/src/dirs.rs | 0.00% <0.00%> (-73.69%) |
:arrow_down: |
aw-server/src/logging.rs | 0.00% <0.00%> (-62.50%) |
:arrow_down: |
aw-transform/src/split_url.rs | 12.50% <0.00%> (-54.17%) |
:arrow_down: |
aw-query/src/ast.rs | 0.00% <0.00%> (-37.26%) |
:arrow_down: |
aw-models/src/key_value.rs | 15.38% <0.00%> (-34.62%) |
:arrow_down: |
aw-transform/src/flood.rs | 66.66% <0.00%> (-16.56%) |
:arrow_down: |
aw-transform/src/filter_keyvals.rs | 77.27% <0.00%> (-6.37%) |
:arrow_down: |
aw-transform/src/find_bucket.rs | 80.00% <0.00%> (-4.22%) |
:arrow_down: |
... and 45 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7d55fca...77eef3d. Read the comment docs.
However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).
So like the opposite of filter_period_intersect and then concatenating window and browser events?
@johan-bjareholt I think something like:
window_events = query(...) # query and do AFK filtering
for browser in browsers: # this for-loop would need to be expressed in the function constructing the query
web_events = query("aw-watcher-web-$browser")
window_events_browser = filter_by(window_events, {app: $browserAppname}) # can't remember what transform is used to do this
web_events_active = filter_period_intersect(browser_events, window_events_browser)` # filter web events by browser being the active window
web_events_active = amend_data(web_events_active, {app: $browserAppname}) # optional, amends the missing {app: $browserAppname}
window_events = union(web_events_active, window_events) # picks events from web_events_active first, then fills rest with window_events
Not sure if we have something like union
(in the way I mean it) already, but I mean something that does:
"bucket" | events
---------------------------------------------------------------
web_events_active | [github ] [youtube ]
window |[spotify ][browser ][terminal ][browser ]
---------------------------------------------------------------
result |[spotify ][github ][terminal ][youtube ]
A complete end-to-end example would be:
bucket | events
---------------------------------------------
window | [terminal ][firefox ][terminal ]
browser | [youtube ][google ]
---------------------------------------------
result | [terminal ][google ][terminal ]
web_events_active = amend_data(web_events_active, {app: $browserAppname}) # optional, amends the missing {app: $browserAppname}
I think this would be a strange transform, inserting data into an event which is hardcoded in the query and does not come from a bucket feels like something we should try to avoid. One reason would be because it encourages people to write dynamically generated queries like we do in the web-ui which is something we don't want.
window_events = union(web_events_active, window_events) # picks events from web_events_active first, then fills rest with window_events
To call that "union" is very misleading in my opinion. A union should be a common ground, not exclude anything.
However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).
What this transform does is essentially this, but it also injects the browsers appname. See the example in the code
/// |---------|--------------------|
/// | events1 |[a ][b ] |
/// | events2 | [c ] [d ]|
/// | result |[a ][ac][bc][b ] |
/// |---------|--------------------|
So if we put that into perspective of window events and browser events already filtered with the browser window:
/// |---------|--------------------------|
/// | window |[firefox ][terminal ]|
/// | browser | [google ] |
/// | result |[f ][f+g ][terminal ]|
/// |---------|--------------------------|
Which is pretty much exacly what you just wrote?
To call that "union" is very misleading in my opinion. A union should be a common ground, not exclude anything.
For sure, I was just unsure what to call it.
I think this would be a strange transform, inserting data into an event which is hardcoded in the query and does not come from a bucket feels like something we should try to avoid.
I agree in general, but in this case the data does come from a bucket (and doesn't actually introduce new hardcoding, since we already have our list of $browserAppname
). We're reinserting exactly what we know we lost before (with filter_by
and filter_period_intersect
).
But on second thought I realize it'll still be messy, due to there being multiple possible appnames for each browser.
So if we put that into perspective of window events and browser events already filtered with the browser window:
<example>
I thought it would become:
/// |---------|---------------------------|
/// | window |[firefox ][terminal ]|
/// | browser | [google ] |
/// | result |[f ][f+g ][t+g][terminal]|
/// |---------|---------------------------|
Basically what I'm trying to get rid of splitting events into two.
Edit: Ah nvm, your example would indeed become as you wrote after filtering the browser events.
But it would still lead to things like this, no?:
/// Abbreviations:
/// - ff: Firefox
/// - t1: Tab 1
/// - ff(t1): Firefox window event polled from when `t1` was active
/// - ff(t1)+t2: Firefox event with title from `t1`, but URL from `t2` (after `merge_map`)
///
/// |---------|------------------------------------|
/// | window |[ff(t1) ][ff(t2) ]|
/// | browser |[t1 ][t2 ]|
/// | result |[ff(t1)+t1 ][ff(t1)+t2 ][ff(t2)+t2 ]|
/// |---------|------------------------------------|
So if I'm not mistaken, this would result in 'middle-events' where the title (which is gotten from the window event, if I understood merge_map
correctly) and URLs (from the web event) are misaligned. This will happen anytime two events don't perfectly overlap (which is always), leading to a lot of small misaligned events like this.
Edit 2: I'm not sure, but maybe this could be resolved by using non-flooded window-events for the union_events_split
and flood after?
Maybe that would lead to:
/// |---------|-------------------------------|
/// | window |[ff(t1) ] [ff(t2) ]|
/// | browser |[t1 ][t2 ]|
/// | result |[ff(t1)+t1 ] [ff(t2)+t2 ]|
/// |---------|-------------------------------|
Which after flooding would become the correct:
/// |---------|-------------------------------|
/// | window |[ff(t1) ] [ff(t2) ]|
/// | browser |[t1 ][t2 ]|
/// | result |[ff(t1)+t1 ][ff(t2)+t2 ]|
/// |---------|-------------------------------|
(or something similar, depending on flooding strategy)
This might require a lot of extra memory though, since we'd need both window_events
(not flooded, for filtering the browser events that go into union_events_split
) and window_events_flooded
(flooded, for filtering the browser events that go into filter_period_intersect
).
Edit 3: Regardless, I'd be happy to merge this if there were more comprehensive tests (for example, checking that these 'middle-events' get created as expected).
Unless we can come up with a neat solution to the problem (which I'm no longer sure there is) I think it's better to just merge this in the meantime, and work on perfecting the transforms/queries later (as this'll probably work good enough).
So if I'm not mistaken, this would result in 'middle-events' where the title (which is gotten from the window event, if I understood merge_map correctly) and URLs (from the web event) are misaligned. This will happen anytime two events don't perfectly overlap (which is always), leading to a lot of small misaligned events like this.
That's a very good point which I thought of before but forgot.
Edit 2: I'm not sure, but maybe this could be resolved by using non-flooded window-events for the union_events_split and flood after?
That sounds like a very clever way of solving it, will try that out.
This might require a lot of extra memory though, since we'd need both window_events (not flooded, for filtering the browser events that go into union_events_split) and window_events_flooded (flooded, for filtering the browser events that go into filter_period_intersect).
I took a deep dive into the query code and it seems like you are correct here. Here's an example of how we would have to change our transforms
- events = flood(query_bucket("bucketname"));
+ events_unflooded = query_bucket("bucketname");
+ events_flooded = flood(events_unflooded)
The query language is very inefficient with its assignments, every time we assign something it gets cloned every time it's used afterwards because we do not know if the variable will be used afterwards or not and we need to guarantee that the variable won't change. So previously when we just called flood(query_bucket("bucketname"))
we could just re-use the previous events while when we assign it we need to first keep the value in the variable and then clone it on each reference as well as not release the value of the variable until the whole query is complete. This can be improved in the future by #119.
But I think that the impact of just one more clone would be minimal compared to the whole issue we have today with #119, there are lots of more unnecessary clones than just this.
Here's a query that works with this transform, the web-ui queries.ts code is really messy though so I'm having a hard time adapting it, especially now that "canonicalQuery" should probably include browser events.
{
"timeperiods": [
"2020-11-01T01:00:00Z/2020-11-30T01:00:00Z"
],
"query": [
"window_events = query_bucket(\"aw-watcher-window_johan-laptop2\");",
"browser_events = query_bucket(\"aw-watcher-web-firefox\");",
"browser_events = split_url_events(browser_events);",
"firefox_events = filter_keyvals(window_events, \"app\", [\"firefox\"]);",
"events = union_events_split(firefox_events, browser_events);",
"events = merge_events_by_keys(events, [\"$domain\"]);",
"events = categorize(events, [
[[\"docs.rs\"], {\"type\": \"regex\", \"regex\": \"^docs.rs$\"}],
[[\"reddit\"], {\"type\": \"regex\", \"regex\": \"^reddit.com$\"}],
[[\"crates.io\"], {\"type\": \"regex\", \"regex\": \"^crates.io$\"}]
]);",
"events = sort_by_duration(events);",
"RETURN = events;"
]
}
Hi! Seems that all major problems have been solved in above discussions. What's currently blocking this PR?