apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Propagate java-attacher errors to Kibana

Open axw opened this issue 3 years ago • 10 comments

When using the java-attacher, an error (e.g. failure to execute java) should be indicated in Kibana somehow. For example, this might be done by setting the status of the APM integration to degraded.

axw avatar Apr 11 '22 03:04 axw

@joshdover are there any plans for adding a more fine grained health check UI to Fleet where this might fit? I believe in the past @ruflin mentioned some vague ideas for a health state per agent, listing all the processes that are supposed to be running.

simitt avatar Apr 11 '22 06:04 simitt

I think that if a policy contains both APM Server and APM Agent configurations (probably only relevant to Java agent now, but hopefully will be relevant to others in the future), we can assume this APM Server is only used for local purposes and simply consider the entire APM integration unhealthy if there is an indication that the agent is unhealthy.

eyalkoren avatar Apr 11 '22 07:04 eyalkoren

@ph @jlind23 @cmacknz Can you chime in on the status and plans on health.

ruflin avatar Apr 11 '22 09:04 ruflin

After APM Server has discovered the Java installation and before it calls the attacher, it should also validate that the Java installation is working as expected.

Currently, APM Server logs this message when invoking the attacher fails: failed to run java attacher: exit status 1.

Checking whether the Java installation is working by invoking java -version (and ideally logging the output to the server logs), helps to see if there's a general issue with the Java setup or if there was something wrong specifically with the attacher.

felixbarny avatar Apr 11 '22 14:04 felixbarny

Tested on Windows, I get same - or slightly worse as it can't download the requested version too

13:24:09.146
elastic_agent.apm_server
[elastic_agent.apm_server][error] failed to run java attacher: exit status 1
13:24:09.785
elastic_agent.apm_server
[elastic_agent.apm_server][error] Failed to download requested agent version 1.27.1, please double-check your --download-agent-version setting.
13:24:09.824
elastic_agent.apm_server
[elastic_agent.apm_server][error] failed to run java attacher: exit status 1

jackshirazi avatar Apr 14 '22 12:04 jackshirazi

@ph @jlind23 @cmacknz Can you chime in on the status and plans on health.

Improving the agent integration health reporting is tracked under https://github.com/elastic/elastic-agent/issues/100. We are just starting to design what this looks like.

cmacknz avatar Apr 14 '22 14:04 cmacknz

Regarding https://github.com/elastic/apm-server/issues/7832#issuecomment-1094631728, it is not yet clear to me whether an integration is supposed to also signal whether or not the Elastic Agent should try to restart the process when reported unhealthy or if there will be more fine granular indication. A restart by the Elastic Agent would not make sense in the described cases. @cmacknz can you already share any more details on how this will look like or expected timelines for the definitions for the healthcheck work?

simitt avatar May 30 '22 12:05 simitt

@simitt We have been iterating on the design details. The proposal is Integration Status Health Reporting. It was being reworked a bit last week but the high level details are right. I added you to the stakeholder list to make sure you are notified of changes.

The new error reporting mechanism needs to be supported in the agent control protocol, @ph can comment on the timeline for implementing this but I suspect implementation will start in 8.4 sometime.

cmacknz avatar May 30 '22 13:05 cmacknz

@felixbarny given the above conversation, I don't think it makes sense to implement something in the apm-server before the healthcheck endpoint in the Elastic Agent is defined. What do you think?

simitt avatar Jun 03 '22 15:06 simitt

Yes, I agree. FYI @eyalkoren

felixbarny avatar Jun 03 '22 17:06 felixbarny

@eyalkoren is looking into splitting the attacher off into its own integration, which would naturally enable surfacing errors. I don't think it makes sense to invest in a lot of changes to Elastic Agent, Fleet, and APM Server in the interim, when we plan to provide a more dedicated integration in the hopefully not too distant future. If needed we can reopen this.

axw avatar Nov 15 '22 09:11 axw