apm-server Perf regression, too many small bulk requests

APM Server version (apm-server version): 8.13.2

Description of the problem including expected versus actual behavior:

I believe there is a regression in the number of bulk requests/s emitted by APM Server towards Elastic. I don't have all data for it, but I believe the issue started in 8.12.x and then got worse with the upgrade to 8.13.x.

This is from the Elastic cloud console: On otherwise identical clients/agents, we see that with the upgrade from 8.12 to 8.13 the number of indexing requests/s jumped noticeably.

Screenshot_2024-04-16_at_23 59 50

The number of documents/s that get ingested didn't really change much, but our hot nodes started to run under higher load. Fortunately, we observed that setting index.translog.durability=async on the all APM indices drops IOPS per hot node back to acceptable levels:

Screenshot_2024-04-21_at_21 35 36

On a decently tuned bulk_max_size setting I'd not expect such a big drop in IOPS/s. I am thus wondering if the APM server is misconfigured in newer versions.

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once https://github.com/elastic/elastic-agent/issues/3797 got merged into the agent?

Apr 22 '24 06:04 StephanErb

@StephanErb thanks for raising this. I've had a look at the code changes, and nothing obviously stands out. There were some changes in the ES output code, but they're not enabled in APM Server 8.13.x. We'll need to try and reproduce the issue and work back from there.

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once https://github.com/elastic/elastic-agent/issues/3797 got merged into the agent?

APM Server has its own ES output implementation, completely independent of Elastic Agent/Beats.

Apr 23 '24 01:04 axw

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

We have upgraded the APM Server and Elastic but the APM clients remained unchanged. There was such no change in data forwarded to APM Server and ultimately indexed into Elastic.

Some stats on ingestion from stack monitoring:

Ingestion of 20000 docs/s into .ds-metrics-apm.app.all-default-2024.04.23-004117
Ingestion of 4000 docs/s into .ds-metrics-apm.internal-default-2024.04.21-000931
Ingestion of 2000 docs/s into .ds-traces-apm-default-2024.04.23-001742
Ingestion of 1000 docs/s into .ds-logs-apm.app.all-default-2024.04.17-000162

Please note we are running a reroute pipeline to prevent per-service indices and instead have them all flow into one all datastream for all services to prevent a shard explosion.

When checking GET _tasks?pretty&human&detailed=true&actions=indices:data/write/bulk we see a few hundred parallel tasks such as:

"iaCFp3aFSnuliL3GpI_fnQ:1325096688": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325096688,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[211], indices[metrics-apm <SNIP not to leak service names> traces-apm-default]",
      "start_time": "2024-04-23T20:26:19.872Z",
      "start_time_in_millis": 1713903979872,
      "running_time": "299.7ms",
      "running_time_in_nanos": 299780808,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
      }
},
"iaCFp3aFSnuliL3GpI_fnQ:1325097388": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325097388,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[261], indices[metrics-apm.internal-default, metrics-apm.app <SNIP not to leak service names>]",
      "start_time": "2024-04-23T20:26:20.165Z",
      "start_time_in_millis": 1713903980165,
      "running_time": "61.3ms",
      "running_time_in_nanos": 61304826,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
    }
}

As per requests[211] and requests[261] my interpretation is we are seeing a batch size of ~200 documents which appears relatively low to me given the overall throughput of the system. (I hope this is correct, I could not find any docs on the format).

This would again be in-line why there was a big drop in IOPS once we moved over to index.translog.durability=async

Apr 23 '24 21:04 StephanErb

Hi :wave:

go-docappender (that's where the indexing logic is implemented) was bumped between those versions. The diff of the changes is at https://github.com/elastic/go-docappender/compare/c69a1cf8ce35...v1.0.0

I'd like to confirm something going deeper:

Can you confirm no config option was changed between versions ? Specifically flush interval, compression level, flush bytes, max requests and retry backoff settings.
Was there any change to the max memory assigned to APM Server ? (memory limits affect the max buffer size)
How is the number of request measured ? (there were some fixes to some specific metrics)

As mentioned above, the APM Server has a custom ES output implementation so it's not using beats presets. One thing that could affect the number of request is status codes for the retry config. We're retrying for 429, 502, 503, 504. Could you check if there is an increase in the number of failed requests or any specific status code ?

Apr 29 '24 08:04 kruskall

Thanks for picking this up.

We just upgraded in the Cloud console without any additional configuration changes for the APM server (flush interval, etc) or APM server memory changes. We are running two APM servers in the max configuration (2x30 GB).

This is how the load looks like as per Stack Monitoring (incl request stats): Screenshot 2024-04-29 at 12 45 11

Unfortunately I don't have enough history retention on the monitoring cluster too look at what happened around the time we upgraded.

In our logs we see a relative large number of indexing errors:

10:23:19.114 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.115 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.116 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.117 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:156] failed to parse field [hikaricp.connections] of type [double] in document with id 'yIhhKY8BTYD0o0NlX6M_'

Could those errors lead to higher-ingest load now that https://github.com/elastic/go-docappender/pull/99 has landed?

Apr 29 '24 10:04 StephanErb

Thank you for the followup!

Are you using any sampling config ? tail based sampling or head based sampling ?

Could those errors lead to higher-ingest load now that https://github.com/elastic/go-docappender/pull/99 has landed?

That specific PR shouldn't impact the number of requests because per-document retries are not enabled in APM Server. It's possible for the ingest load to increase if APM Server receives specific status code and the request retry code is executed.

There are two retry logic:

document retries: this one retries single documents in a bulk request if they return specific status codes. This is not enabled in APM Server and cannot be enabled by config.
request retries: this one retries the HTTP request (bulk index) if ES responds with specific status codes (429, 502, 503, 504).

Apr 30 '24 13:04 kruskall

@StephanErb it would be great if you could provide the information on whether you have sampling enabled, and if so, whether it is tail based or head based sampling. A bugfix was rolled out in 8.13 that might explain the behaviour, but to confirm we would need more information about the sampling.

May 23 '24 14:05 simitt

@simitt this is with:

head-based sampling
sample rate of about 10%

May 23 '24 14:05 StephanErb

hello @StephanErb apologies for the delay here, I wanted to ask one clarification which I didn't see mentioned in the thread. Did you also upgrade Elasticsearch to the same 8.13.x version as apm-server?

Jun 17 '24 09:06 inge4pres

Yes, both got updated to the same version via a regular update initiated on Elastic cloud.

Jun 17 '24 10:06 StephanErb

I'm investigating this issue and I think I've found the bug. I'll create a PR urgently to fix this.

Jul 04 '24 06:07 marclop

apm-server apm-server copied to clipboard

Perf regression, too many small bulk requests

apm-server
apm-server copied to clipboard