apm-agent-dotnet icon indicating copy to clipboard operation
apm-agent-dotnet copied to clipboard

Act on server responses and implement back-off

Open gregkalapos opened this issue 5 years ago • 7 comments

Currently the agent tries to send each event once, and if it fails, it just logs.

We should implement a retry logic and listen to responses from the server and act on those responses.

From: https://github.com/elastic/apm-agent-dotnet/pull/136#discussion_r264652202

gregkalapos avatar Mar 12 '19 20:03 gregkalapos

Related: https://github.com/elastic/apm-agent-java/pull/181 https://github.com/elastic/apm-agent-java/pull/210

I think we decided that there will be no retries in sending messages, if the message send fails it will get lost, the backoff is for the next in the queue.

Please check with other agents to make sure we have a common approach.

alvarolobato avatar Mar 14 '19 12:03 alvarolobato

See https://github.com/elastic/apm/blob/master/docs/agent-development.md#transport-errors

SergeyKleyman avatar Aug 06 '19 11:08 SergeyKleyman

When agent receives unsuccessful HTTP status code from APM Server we should log not just the body of the response but HTTP status code itself as well. For example in #433 - we logged

{PayloadSenderV2} Failed sending event. queue is full

and logging HTTP status code as well would have made it easier to find what condition among the ones listed at https://www.elastic.co/guide/en/apm/server/7.3/common-problems.html triggered the response.

SergeyKleyman avatar Aug 06 '19 11:08 SergeyKleyman

https://github.com/elastic/apm/blob/master/docs/agent-development.md#batchingstreaming-data is 404, is there any other update on this? FlushInterval is not working for me. It flushes randomly and really slow.

wast avatar Feb 26 '21 16:02 wast

That part of the doc got moved here.

No update on implementing back-off. We haven't done that.

What's exactly your issue @wast?

gregkalapos avatar Feb 26 '21 16:02 gregkalapos

Well I got here while looking for FlushInterval and performance issues. I'm having a problem with rare Flushes.

image

My single transaction is quite big, it has 10 spans and I'm running it at localhost currently. When I turn on small load test like 100 RPM, APM server doesn't get data from my app.

Look at "newEventQueueCounts" and timing of "{PayloadSenderV2} Sent items to server".

image

Current config: "TransactionSampleRate": 0.2, "StackTraceLimit": 10, "FlushInterval": "5s", //doesn't work "MaxQueueEventCount": 5000, "SpanFramesMinDuration": "100ms", //this doesn't seem work, all spans including smaller than 100ms are recorded "CaptureHeaders": true, "CentralConfig": false, "CloudProvider": "none"

wast avatar Feb 26 '21 16:02 wast

Let me move this to a separate issue - it seems to be unrelated to this one. I opened https://github.com/elastic/apm-agent-dotnet/issues/1210 and I'll try to help there.

gregkalapos avatar Feb 26 '21 17:02 gregkalapos