amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

Add retries for publishing metrics & health checks

Open strategicpause opened this issue 1 year ago • 0 comments

Summary

This is a request to add retries in the case of the agent failing to publish metrics or health check messages to TACS.

Description

I noticed in my logs that I see cases where the ecs agent is emitting the message "Error publishing metrics" to the logs. From looking at the code it looks like the tcsClientServer.publishMessages is reading metrics & health metrics from a channel and then emitting an error if the metrics were unable to be published. This behavior will result in either metrics or health checks failed to be reported to TACS when there is an error sending a message to TACS. For example, this could occur when a WS connection is closed from the server, which results in the client initiating a new connection.

Expected Behavior

I would expect some kind of retry mechanism which would attempt to send the metrics or health checks over the connection. I don't see any retry logic further down the stack either ie: ClientServerImpl.MakeRequest.

Observed Behavior

The following log line:

05:20:14.273 | {"level":"warn","time":"2024-03-02T05:20:14.032","msg":"Error publishing metrics","error":"websocket: close sent"}

Environment Details

Running on AL2 with kernel 5.10

Supporting Log Snippets

See above.

strategicpause avatar Mar 05 '24 17:03 strategicpause