Propagate java-attacher errors to Kibana
When using the java-attacher, an error (e.g. failure to execute java) should be indicated in Kibana somehow. For example, this might be done by setting the status of the APM integration to degraded.
@joshdover are there any plans for adding a more fine grained health check UI to Fleet where this might fit? I believe in the past @ruflin mentioned some vague ideas for a health state per agent, listing all the processes that are supposed to be running.
I think that if a policy contains both APM Server and APM Agent configurations (probably only relevant to Java agent now, but hopefully will be relevant to others in the future), we can assume this APM Server is only used for local purposes and simply consider the entire APM integration unhealthy if there is an indication that the agent is unhealthy.
@ph @jlind23 @cmacknz Can you chime in on the status and plans on health.
After APM Server has discovered the Java installation and before it calls the attacher, it should also validate that the Java installation is working as expected.
Currently, APM Server logs this message when invoking the attacher fails: failed to run java attacher: exit status 1.
Checking whether the Java installation is working by invoking java -version (and ideally logging the output to the server logs), helps to see if there's a general issue with the Java setup or if there was something wrong specifically with the attacher.
Tested on Windows, I get same - or slightly worse as it can't download the requested version too
13:24:09.146
elastic_agent.apm_server
[elastic_agent.apm_server][error] failed to run java attacher: exit status 1
13:24:09.785
elastic_agent.apm_server
[elastic_agent.apm_server][error] Failed to download requested agent version 1.27.1, please double-check your --download-agent-version setting.
13:24:09.824
elastic_agent.apm_server
[elastic_agent.apm_server][error] failed to run java attacher: exit status 1
@ph @jlind23 @cmacknz Can you chime in on the status and plans on health.
Improving the agent integration health reporting is tracked under https://github.com/elastic/elastic-agent/issues/100. We are just starting to design what this looks like.
Regarding https://github.com/elastic/apm-server/issues/7832#issuecomment-1094631728, it is not yet clear to me whether an integration is supposed to also signal whether or not the Elastic Agent should try to restart the process when reported unhealthy or if there will be more fine granular indication. A restart by the Elastic Agent would not make sense in the described cases. @cmacknz can you already share any more details on how this will look like or expected timelines for the definitions for the healthcheck work?
@simitt We have been iterating on the design details. The proposal is Integration Status Health Reporting. It was being reworked a bit last week but the high level details are right. I added you to the stakeholder list to make sure you are notified of changes.
The new error reporting mechanism needs to be supported in the agent control protocol, @ph can comment on the timeline for implementing this but I suspect implementation will start in 8.4 sometime.
@felixbarny given the above conversation, I don't think it makes sense to implement something in the apm-server before the healthcheck endpoint in the Elastic Agent is defined. What do you think?
Yes, I agree. FYI @eyalkoren
@eyalkoren is looking into splitting the attacher off into its own integration, which would naturally enable surfacing errors. I don't think it makes sense to invest in a lot of changes to Elastic Agent, Fleet, and APM Server in the interim, when we plan to provide a more dedicated integration in the hopefully not too distant future. If needed we can reopen this.