containers-roadmap
containers-roadmap copied to clipboard
[Fargate] [request]: Enhance the reliability of FireLens on Fargate
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Tell us about your request FireLens has been demonstrated to be a fairly reliable log solution.
That being said, FireLens could go farther on Fargate, become more managed and provide greater assurances of reliability. The Fargate platform has unique challenges because containers are ephemeral. Currently, the FireLens container is just another container in the Task, and when the Task stops is gets a standard 30 second Sigterm to SigKill timeout. Furthermore, in the unlikely case that Fluentd/Bit goes down, all logs would be lost because Fargate containers are ephemeral.
Ideally, AWS could provide two features for FireLens on Fargate to improve reliability:
- Enable a file buffer for the Fluentd/Bit FireLens container, and restart the container if it goes down. Failures in the FireLens container would not stop a task, and logs would be preserved between stops and re-starts.
- Build a more robust mechanism than the built in Sigterm-SigKill timeout for the FireLens container. Ideally, after a task stops the FireLens container would be given sufficient time to send all logs/data (up to a reasonable timeout measured in minutes). This might require changes to Fluent Bit. That way, when your task stops, all logs/data would be retrieved. (Note that providing a hard guarantee/promise around reliability is almost certainly impossible).
Kudos to this proposal. Having a tool like this for Fargate would definitely improve the sales pitch of the managed container service. The ability to almost guarantee logs while not crashing the main task in the unlikely eventuality of the sidecar crashing is truly a remarkable reliability solution.