pcp
pcp copied to clipboard
pmseries: fix sampling when start < first sample or series has gaps
If the start timestamp is before the first sample of a series, or the series has gaps, the goal timestamp will lag behind the next timestamp, and therefore the logic will always select the next sample.
This commit resets the goal timestamp if it lags behind.
As mentioned in the description, this is mainly occurring if the start= parameter is set to a timestamp before the first sample of a series - a common case when analyzing archives with the quay.io/performancecopilot/archive-analysis container. If that's the case, pmproxy sends all samples (because the goal timestamp always lags behind), effectively ignoring the interval= parameter, resulting in much larger reponse sizes. But it can also occur if there are gaps in a series.
Tests with a (customer) archive show that this PR reduces pmproxy response sizes from 74 MB to 20 MB of JSON for the same time window. Still very high fwiw, maybe it's time to replace JSON with protobuf at some point, or optimize the response data model. Or simply changing the granularity of the charts (by default, Grafana calculates the interval by time range / panel width, i.e. one data point per pixel basically).
It's still not perfect though, in the 20s interval, starting 2m before first sample in archive output there is once a 30s gap, but it's better than before. imho the main issue is that samples in the archive are not exactly 10s apart, but sometimes 10s plus a few nanoseconds and sometimes 10s minus a few nanoseconds, making interval sampling tricky.
| [...] 20 MB of JSON [...] Still very high fwiw,
@andreasgerstmayr we haven't yet implemented http compression in pmproxy and this feels like it might provide a decent improvement here (json => text => often compresses well). There was some zlib-based code in pmwebd along these lines that wasn't ported across - I vaguely mention some mention of compatibility issues there (mentioned in comments in the code? distant memory, YMMV) - but that feels like a place to start looking to make further inroads here.
| [...] 20 MB of JSON [...] Still very high fwiw,
@andreasgerstmayr we haven't yet implemented http compression in pmproxy and this feels like it might provide a decent improvement here (json => text => often compresses well). There was some zlib-based code in pmwebd along these lines that wasn't ported across - I vaguely mention some mention of compatibility issues there (mentioned in comments in the code? distant memory, YMMV) - but that feels like a place to start looking to make further inroads here.
Excellent idea. grafana-pcp uses the http module from the Go standard library, which already supports gzip encoding. I've created #1685 to track this feature in pmproxy.
wdyt about the PR? Initially I've changed the code here: https://github.com/performancecopilot/pcp/blob/3b5b55aa30fb7c42436111b12322c8d6b5033e33/src/libpcp_web/src/query.c#L624-L627 to use the timestamp of the first sample if it's after than the requested start= parameter. Later I figured we'll have the same problem if there are any gaps in a series, so I updated the use_next_sample function with a heuristic to detect gaps in general.
wdyt about the PR?
Yep, I think thats a neat approach - nice, simple and effective.
I wonder a bit about the jq dependency - should we vendor this and drop our own pmjson (or install 'jq' as 'pmjson' for back-compat, since that's released and used in REST docs etc). Looks like it has some really nice features that'd be generally useful for QA, as well as REST API color neatness. (does it have any deps..? I see its not in RHEL, but from a super quick look, seems like it supports Mac and Windows).
I wonder a bit about the jq dependency - should we vendor this and drop our own pmjson (or install 'jq' as 'pmjson' for back-compat, since that's released and used in REST docs etc). Looks like it has some really nice features that'd be generally useful for QA, as well as REST API color neatness. (does it have any deps..? I see its not in RHEL, but from a super quick look, seems like it supports Mac and Windows).
Yep, it's really useful for debugging, having proper dates vs. decoding the timestamp by hand.
If we want to run this test also on RHEL I see a few options:
(1) drop the dependency from the test (i.e. compare timestamps), and just add a comment "use this jq snippet when debugging"
(2) vendor jq as you suggested. Looking at the Fedora specfile: https://src.fedoraproject.org/rpms/jq/blob/rawhide/f/jq.spec the only dependencies are flex, bison, chrpath and oniguruma-devel, afaics all are in RHEL
(3) extend our pmjson utility with an option to always convert the timestamp field to a human-readable date
In terms of resources/prioritization and simplicity, I'm leaning towards (1) or maybe (3), and I'd go with (2) only if we decide to use more functionality of jq in the future.
Yeah, 3 was always the option I was planning on taking originally. However, I've been unable to make any progress there for lack of time & given our recent successes with vendoring, I'm thinking now it would be much better to use someone else's well-maintained code instead.
Related: #669 and #421