Add retries for publishing metrics & health checks
Summary
This is a request to add retries in the case of the agent failing to publish metrics or health check messages to TACS.
Description
I noticed in my logs that I see cases where the ecs agent is emitting the message "Error publishing metrics" to the logs. From looking at the code it looks like the tcsClientServer.publishMessages is reading metrics & health metrics from a channel and then emitting an error if the metrics were unable to be published. This behavior will result in either metrics or health checks failed to be reported to TACS when there is an error sending a message to TACS. For example, this could occur when a WS connection is closed from the server, which results in the client initiating a new connection.
Expected Behavior
I would expect some kind of retry mechanism which would attempt to send the metrics or health checks over the connection. I don't see any retry logic further down the stack either ie: ClientServerImpl.MakeRequest.
Observed Behavior
The following log line:
05:20:14.273 | {"level":"warn","time":"2024-03-02T05:20:14.032","msg":"Error publishing metrics","error":"websocket: close sent"}
Environment Details
Running on AL2 with kernel 5.10
Supporting Log Snippets
See above.