m3
m3 copied to clipboard
[aggregator] Add compression for data sent to aggregator
What this PR does / why we need it:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing and/or backwards incompatible change?:
Does this PR require updating code package or user-facing documentation?:
Codecov Report
Merging #2082 (78867a8) into master (01958a4) will decrease coverage by
0.2%. The diff coverage is53.6%.
@@ Coverage Diff @@
## master #2082 +/- ##
========================================
- Coverage 72.3% 72.2% -0.2%
========================================
Files 1015 1014 -1
Lines 87961 87492 -469
========================================
- Hits 63683 63252 -431
+ Misses 20029 20003 -26
+ Partials 4249 4237 -12
| Flag | Coverage Δ | |
|---|---|---|
| aggregator | 81.7% <42.8%> (-0.4%) |
:arrow_down: |
| cluster | 85.6% <ø> (ø) |
|
| collector | 64.8% <ø> (ø) |
|
| dbnode | 79.7% <ø> (+<0.1%) |
:arrow_up: |
| m3em | 73.2% <ø> (ø) |
|
| m3ninx | 73.9% <ø> (ø) |
|
| m3nsch | 51.1% <ø> (ø) |
|
| metrics | 17.7% <ø> (ø) |
|
| msg | 74.3% <65.2%> (-0.7%) |
:arrow_down: |
| query | 68.2% <ø> (-0.5%) |
:arrow_down: |
| x | 83.3% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 01958a4...78867a8. Read the comment docs.
Hi team 👋
I know that this PR is not yet merged, but we have started testing it and the results look quite promising.
After we deployed the docker images built with this PR we've seen ~25 write errors/s. From the m3aggregator logs:
{"level":"error","ts":1583968033.1887946,"msg":"decode error","remoteAddress":"10.2.10.207:34966","error":"snappy: corrupt input"}
I thought to provide some feedback here and also check if you have any advice on how to track the underline issue. Some time ago we had a problem with some particular metrics that had a super long value for a label (enough to trigger the id too large error). Wondering if something similar might be affecting the compression?
This change is something we need as we are hitting network congestion limits too in many of our m3coordinator pods.
@jorgelbg I believe that the errors that you are facing are because the connection from the coordinator are timing out and get closed immediately. This in turn causes issues when decoding on the m3aggregator side. In order to test this hypothesis you can increase the connection timoeut in your m3coordinator:
downsample:
remoteAggregator:
client:
...
connection:
writeTimeout: 15000ms # set this value to something higher and monitor in the e2e dashboard
I was having the same issue and after I increased this timeout the errors almost disappeared. See screenshot attached:

Do you need help with this PR? I would be happy to resolve the conflicts and action the feedback comments.