apm-server Various ambiguous APM server errors in the logs are not useful

We see in some situations (e.g. when APM server is all queued up), tens of thousands of ambiguous/brief error messages (e.g. data validation error, internal error, EOF) that are not useful for troubleshooting. Here are examples of various ambiguous errors in APM Server logs and suggested improvements:

data validation error

Can we include what part of the document structure is invalid? Or instruct the user to enable debugging to find out?
Can we include some "identification" information so users can quickly determine the origin of these rejected events?

internal error

Can we indicate what exactly is the error here?

unexpected EOF

Can we indicate what is wrong here or if it's a side effect of some other error?

failed to publish events: Post "<ES_endpoint>/_bulk": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Can we explain which "client" timeout it is and reference the documentation for information on how one can adjust this (or should they)? The APM agent's?

failed to publish events: Post "<ES_endpoint>/_bulk": EOF

Can we indicate what is wrong here or if it's a side effect of some other error?

404 page not found

What page are we referring to? What request are we trying to make?

sending request to kibana failed

What request are we trying to make? What is the impact if such request fails?

request timed out

What request are we trying to make (are we referring to indexing request to Elasticsearch)? What is the impact if such request fails?

request size exceeded

Can we include the size of the problem event and the current "size limit" in the message?
Can we include some "identification" information so users can quickly determine the origin of these rejected events?

It will be a better experience to provide details on what APM Server is complaining about :) For errors that can repeat tens of thousands of times, we might want to suppress some of these repeated errors that tend to drown the APM Server log file (e.g., if it's the same error, emit 1 error type every N minutes with the # of occurrences).

Oct 28 '21 19:10 ppf2

Thanks for opening this @ppf2 . In general the error.message is supposed to contain details on the error, the message itself is kept brief to be able to group by common issues.

data validation error

We should be able to include some more details on where the data validation error has happened (metadata, event, etc.) and what triggered it for most cases, potentially also the service.name.

internal error

Looking at the code, the available error information should already be logged. We need to dive deeper if additional context can be added.

unexpected EOF

not really, this is an unexpected end of file while decoding the input stream - when the input can'd even be decoded, there are no details available to add.

failed to publish events: Post "<ES_endpoint>/_bulk": context deadline exceeded (Client.Timeout exceeded while awaiting headers) failed to publish events: Post "<ES_endpoint>/_bulk": EOF

We are actively working on switching to a new ES indexer (https://github.com/elastic/apm-server/issues/6360), where these errors are going to be showing up in a different way.

404 page not found sending request to kibana failed request timed out

@ppf2 is there no other information logged that gives context outside of the error message? Could you share an example (ofc. redacting any sensitive information)?

request size exceeded

Can we include the size of the problem event and the current "size limit" in the message?
Can we include some "identification" information so users can quickly determine the origin of these rejected events?

The size of the problematic event is not available as we stop reading as soon as the buffer is full. We can look into adding some information about the service.name for which this happens. We don't want to add the current size limit to every one of these log lines, as this is a static configuration of the apm-server, that users can find in their server setup.

Nov 10 '21 10:11 simitt

It looks like sending request to kibana failed is related to agent-configuration:

sending request to kibana failed: Post "http://<host:port>/api/apm/settings/agent-configuration/search": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

For the other ones, I will send you the link to an internal dashboard :)

Nov 10 '21 21:11 ppf2

(This one shows up now that APM is part of Agent integration - these show up in the agent logs)

HTTP done

4K+ of these 200 status code messages (for both GET and POST) observed within a 2 hour period in a deployment at the INFO level

Feb 23 '22 17:02 ppf2

@ppf2 the HTTP done messages are not omitted by the apm-server, they are logged by elastic-agent or fleet-server. Afaik the team is aware and working on reducing the log message noise (@ph to confirm).

Feb 24 '22 08:02 simitt

We are aware of the noise, we have removed the HTTP Done in https://github.com/elastic/fleet-server/pull/1101, we are planning to review logging on Agent and Fleet. We are not happy with the level of the log, sometime we log too much, not enough. We need to make them more actionable.

Feb 24 '22 14:02 ph

Appending on to this APM Server (V8.3) errors are not guaranteed to be level:ERROR (vs level:DEBUG), e.g. context deadline exceeded:

{"log.level":"debug","@timestamp":"XXXXX","log.logger":"stacktrace","log.origin":{"file.name":"sourcemap/processor.go","file.line":123},"message":"failed to fetch source map: failure querying ES: context deadline exceeded","service.name":"apm-server","ecs.version":"1.6.0"}

Jul 14 '22 20:07 stefnestor