m3 icon indicating copy to clipboard operation
m3 copied to clipboard

[aggregator] Add compression for data sent to aggregator

Open robskillington opened this issue 5 years ago • 4 comments

What this PR does / why we need it:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:


Does this PR require updating code package or user-facing documentation?:


robskillington avatar Dec 27 '19 01:12 robskillington

Codecov Report

Merging #2082 (78867a8) into master (01958a4) will decrease coverage by 0.2%. The diff coverage is 53.6%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #2082     +/-   ##
========================================
- Coverage    72.3%   72.2%   -0.2%     
========================================
  Files        1015    1014      -1     
  Lines       87961   87492    -469     
========================================
- Hits        63683   63252    -431     
+ Misses      20029   20003     -26     
+ Partials     4249    4237     -12     
Flag Coverage Δ
aggregator 81.7% <42.8%> (-0.4%) :arrow_down:
cluster 85.6% <ø> (ø)
collector 64.8% <ø> (ø)
dbnode 79.7% <ø> (+<0.1%) :arrow_up:
m3em 73.2% <ø> (ø)
m3ninx 73.9% <ø> (ø)
m3nsch 51.1% <ø> (ø)
metrics 17.7% <ø> (ø)
msg 74.3% <65.2%> (-0.7%) :arrow_down:
query 68.2% <ø> (-0.5%) :arrow_down:
x 83.3% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 01958a4...78867a8. Read the comment docs.

codecov[bot] avatar Dec 27 '19 02:12 codecov[bot]

Hi team 👋

I know that this PR is not yet merged, but we have started testing it and the results look quite promising.

After we deployed the docker images built with this PR we've seen ~25 write errors/s. From the m3aggregator logs:

{"level":"error","ts":1583968033.1887946,"msg":"decode error","remoteAddress":"10.2.10.207:34966","error":"snappy: corrupt input"}

I thought to provide some feedback here and also check if you have any advice on how to track the underline issue. Some time ago we had a problem with some particular metrics that had a super long value for a label (enough to trigger the id too large error). Wondering if something similar might be affecting the compression?

jorgelbg avatar Mar 11 '20 23:03 jorgelbg

This change is something we need as we are hitting network congestion limits too in many of our m3coordinator pods.

@jorgelbg I believe that the errors that you are facing are because the connection from the coordinator are timing out and get closed immediately. This in turn causes issues when decoding on the m3aggregator side. In order to test this hypothesis you can increase the connection timoeut in your m3coordinator:


downsample:
  remoteAggregator:
    client:
      ...
      connection:
          writeTimeout: 15000ms   # set this value to something higher and monitor in the e2e dashboard

I was having the same issue and after I increased this timeout the errors almost disappeared. See screenshot attached: Screen Shot 2020-05-06 at 11 54 54 pm

pavelnikolov avatar May 06 '20 13:05 pavelnikolov

Do you need help with this PR? I would be happy to resolve the conflicts and action the feedback comments.

pavelnikolov avatar May 06 '20 14:05 pavelnikolov