agones
agones copied to clipboard
Metrics data loss in K8S controller
What happened: After restarting the K8S controller, the "agones_gameservers_total" metric is no longer being collected for the "shipping-mode1-map1-3568" battle server. However, the "agones_gameservers_count" metric is still being collected.
What you expected to happen: I expected both the "agones_gameservers_total" and "agones_gameservers_count" metrics to continue being collected consistently, even after the controller restart.
- max by(type) (agones_gameservers_total{fleet_name="shipping-mode1-map1-3568"})
- avg by(type) (agones_gameservers_count{fleet_name="shipping-mode1-map1-3568"})
How to reproduce it (as minimally and precisely as possible):
- Start the K8S controller - Agones Controller.
- Create "shipping-mode1-map1-3568" fleet
- Check the metrics for the "shipping-mode1-map1-3568" battle server.
- Observe that the "agones_gameservers_total" metric is being collected, but "agones_gameservers_count" metric is not.
- Restart the controller.
- Check the metrics for the "shipping-mode1-map1-3568" battle server after the restart.
- Notice that the "agones_gameservers_total" metric is no longer being collected, while the "agones_gameservers_count" metric is still being collected.
Anything else we need to know?:
- In the cluster, there are a total of 10 fleets, and there is a continuous process of deleting existing fleets and creating new fleets. This dynamic fleet activity might have an impact on the metrics data collection.
Environment:
-
Agones version: 1.35.0
-
Kubernetes version (use
kubectl version
): Client Version: v1.27.2 Kustomize Version: v5.0.1 Server Version: v1.22.5-tke.19 -
Cloud provider or hardware configuration:
-
Install method (yaml/helm): helm
-
Troubleshooting guide log(s):
-
Others:
This sounds like works as intended.
- If a fleet is deleted and we restart the controller, we can't create the old metrics - it's all in memory.
- If a fleet is deleted we specifically remove it from all metrics reporting to ensure a memory leak / metric explosion doesn't happen (we have to do a full reset to do it).
See https://github.com/googleforgames/agones/issues/2478 for context.
This sounds like works as intended.
- If a fleet is deleted and we restart the controller, we can't create the old metrics - it's all in memory.
- If a fleet is deleted we specifically remove it from all metrics reporting to ensure a memory leak / metric explosion doesn't happen (we have to do a full reset to do it).
See #2478 for context.
In my specific scenario, the controller metrics has already malfunctioned before the restart.
- Start the K8S controller - Agones Controller.
- Create "shipping-mode1-map1-3568" fleet
- Check the metrics for the "shipping-mode1-map1-3568" battle server.
- Observe that the "agones_gameservers_total" metric is being collected, but "agones_gameservers_count" metric is not.
That's a good point - will have to attempt to replicate 🤔
That's a good point - will have to attempt to replicate 🤔
In our use case, Agones is configured with 10 fleets, and each fleet has a fleet autoscaler enabled. Additionally, 10 separate gameservers have been configured, which are not managed by the fleets.
Hope this can help you successfully reproduce the issue. Thank you for your hard work.
That's a good point - will have to attempt to replicate 🤔
Hello,markmandel,
I hope this message finds you well. I wanted to follow up on the issue. Furthermore, I understand that replicating the issue can sometimes be challenging, and I'm wondering if there's any additional information or assistance I can provide to facilitate the process.
If any specific scenarios, logs, or system configurations would be helpful, please let me know. I’m also willing to assist with testing or any other tasks that might help you address the issue more efficiently.
Looking forward to your guidance on how I can best support your efforts. Thank you for your time and attention to this matter.
Best regards
Sorry this isn't currently at the top of my priority queue, so haven't had a chance to look at it. Would definitely be happy to provide pointers if you wanted to dig into it?
Sorry this isn't currently at the top of my priority queue, so haven't had a chance to look at it. Would definitely be happy to provide pointers if you wanted to dig into it?
Thank you for your prompt response. I completely understand that this issue may not be your top priority at the moment. I appreciate your willingness to provide pointers for further investigation.
If there's a more suitable time for you to delve into this matter or if you have any initial thoughts to share, I would be grateful for any guidance you can provide.
Looking forward to your insights.
If you would like to go digging (and i encourage it!), all these metrics are managed here: https://github.com/googleforgames/agones/tree/main/pkg/metrics
Feel free to drop questions here, or in #development channel on our Slack!
We have replicated this issue locally and the agones_gameservers_total
is missing after restarting the agones-controller.
Before restart:
After:
In Agones version 1.35.0, disabling the FeatureGate: "ResetMetricsOnDelete" can resolve issues with metrics anomalies.
Through an in-depth analysis of the source code, I've discovered that this feature can lead to certain memory optimization benefits. However, it also results in an increase in code complexity. Notably, during this optimization process, there seems to be a bug within the code that causes anomalies in the metrics indicators.
Based on these findings, I will attempt to fix this issue and provide a pull request (PR) if everything goes smoothly.
Thanks for digging in!