apm-server
                                
                                
                                
                                    apm-server copied to clipboard
                            
                            
                            
                        Perf regression, too many small bulk requests
APM Server version (apm-server version):
8.13.2
Description of the problem including expected versus actual behavior:
I believe there is a regression in the number of bulk requests/s emitted by APM Server towards Elastic. I don't have all data for it, but I believe the issue started in 8.12.x and then got worse with the upgrade to 8.13.x.
This is from the Elastic cloud console: On otherwise identical clients/agents, we see that with the upgrade from 8.12 to 8.13 the number of indexing requests/s jumped noticeably.
The number of documents/s that get ingested didn't really change much, but our hot nodes started to run under higher load. Fortunately, we observed that setting index.translog.durability=async on the all APM indices drops IOPS per hot node back to acceptable levels:
On a decently tuned bulk_max_size setting I'd not expect such a big drop in IOPS/s. I am thus wondering if the APM server is misconfigured in newer versions.
It almost feels like the server uses the agent defaults instead of its custom bulk size option?  Was it maybe forgotten to set preset: custom into the server config once https://github.com/elastic/elastic-agent/issues/3797 got merged into the agent?
@StephanErb thanks for raising this. I've had a look at the code changes, and nothing obviously stands out. There were some changes in the ES output code, but they're not enabled in APM Server 8.13.x. We'll need to try and reproduce the issue and work back from there.
The number of documents/s that get ingested didn't really change much
Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?
It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once https://github.com/elastic/elastic-agent/issues/3797 got merged into the agent?
APM Server has its own ES output implementation, completely independent of Elastic Agent/Beats.
The number of documents/s that get ingested didn't really change much
Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?
We have upgraded the APM Server and Elastic but the APM clients remained unchanged. There was such no change in data forwarded to APM Server and ultimately indexed into Elastic.
Some stats on ingestion from stack monitoring:
- Ingestion of 20000 docs/s into .ds-metrics-apm.app.all-default-2024.04.23-004117
 - Ingestion of 4000 docs/s into .ds-metrics-apm.internal-default-2024.04.21-000931
 - Ingestion of 2000 docs/s into .ds-traces-apm-default-2024.04.23-001742
 - Ingestion of 1000 docs/s into .ds-logs-apm.app.all-default-2024.04.17-000162
 
Please note we are running a reroute pipeline to prevent per-service indices and instead have them all flow into one all datastream for all services to prevent a shard explosion.
When checking  GET _tasks?pretty&human&detailed=true&actions=indices:data/write/bulk we see a few hundred parallel tasks such as:
"iaCFp3aFSnuliL3GpI_fnQ:1325096688": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325096688,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[211], indices[metrics-apm <SNIP not to leak service names> traces-apm-default]",
      "start_time": "2024-04-23T20:26:19.872Z",
      "start_time_in_millis": 1713903979872,
      "running_time": "299.7ms",
      "running_time_in_nanos": 299780808,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
      }
},
"iaCFp3aFSnuliL3GpI_fnQ:1325097388": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325097388,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[261], indices[metrics-apm.internal-default, metrics-apm.app <SNIP not to leak service names>]",
      "start_time": "2024-04-23T20:26:20.165Z",
      "start_time_in_millis": 1713903980165,
      "running_time": "61.3ms",
      "running_time_in_nanos": 61304826,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
    }
}
As per requests[211] and requests[261] my interpretation is we are seeing a batch size of ~200 documents which appears relatively low to me given the overall throughput of the system. (I hope this is correct, I could not find any docs on the format).
This would again be in-line why there was a big drop in IOPS once we moved over to index.translog.durability=async
Hi :wave:
go-docappender (that's where the indexing logic is implemented) was bumped between those versions. The diff of the changes is at https://github.com/elastic/go-docappender/compare/c69a1cf8ce35...v1.0.0
I'd like to confirm something going deeper:
- Can you confirm no config option was changed between versions ? Specifically flush interval, compression level, flush bytes, max requests and retry backoff settings.
 - Was there any change to the max memory assigned to APM Server ? (memory limits affect the max buffer size)
 - How is the number of request measured ? (there were some fixes to some specific metrics)
 
As mentioned above, the APM Server has a custom ES output implementation so it's not using beats presets. One thing that could affect the number of request is status codes for the retry config. We're retrying for 429, 502, 503, 504. Could you check if there is an increase in the number of failed requests or any specific status code ?
Thanks for picking this up.
We just upgraded in the Cloud console without any additional configuration changes for the APM server (flush interval, etc) or APM server memory changes. We are running two APM servers in the max configuration (2x30 GB).
This is how the load looks like as per Stack Monitoring (incl request stats):
Unfortunately I don't have enough history retention on the monitoring cluster too look at what happened around the time we upgraded.
In our logs we see a relative large number of indexing errors:
10:23:19.114 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.115 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.116 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.117 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:156] failed to parse field [hikaricp.connections] of type [double] in document with id 'yIhhKY8BTYD0o0NlX6M_'
Could those errors lead to higher-ingest load now that https://github.com/elastic/go-docappender/pull/99 has landed?
Thank you for the followup!
Are you using any sampling config ? tail based sampling or head based sampling ?
Could those errors lead to higher-ingest load now that https://github.com/elastic/go-docappender/pull/99 has landed?
That specific PR shouldn't impact the number of requests because per-document retries are not enabled in APM Server. It's possible for the ingest load to increase if APM Server receives specific status code and the request retry code is executed.
There are two retry logic:
- document retries: this one retries single documents in a bulk request if they return specific status codes. This is not enabled in APM Server and cannot be enabled by config.
 - request retries: this one retries the HTTP request (bulk index) if ES responds with specific status codes (429, 502, 503, 504).
 
@StephanErb it would be great if you could provide the information on whether you have sampling enabled, and if so, whether it is tail based or head based sampling. A bugfix was rolled out in 8.13 that might explain the behaviour, but to confirm we would need more information about the sampling.
@simitt this is with:
- head-based sampling
 - sample rate of about 10%
 
hello @StephanErb apologies for the delay here, I wanted to ask one clarification which I didn't see mentioned in the thread. Did you also upgrade Elasticsearch to the same 8.13.x version as apm-server?
Yes, both got updated to the same version via a regular update initiated on Elastic cloud.
I'm investigating this issue and I think I've found the bug. I'll create a PR urgently to fix this.