agones icon indicating copy to clipboard operation
agones copied to clipboard

Metrics data loss in K8S controller

Open alvin-7 opened this issue 1 year ago • 8 comments

What happened: After restarting the K8S controller, the "agones_gameservers_total" metric is no longer being collected for the "shipping-mode1-map1-3568" battle server. However, the "agones_gameservers_count" metric is still being collected.

What you expected to happen: I expected both the "agones_gameservers_total" and "agones_gameservers_count" metrics to continue being collected consistently, even after the controller restart.

  • max by(type) (agones_gameservers_total{fleet_name="shipping-mode1-map1-3568"}) image
  • avg by(type) (agones_gameservers_count{fleet_name="shipping-mode1-map1-3568"}) image

How to reproduce it (as minimally and precisely as possible):

  1. Start the K8S controller - Agones Controller.
  2. Create "shipping-mode1-map1-3568" fleet
  3. Check the metrics for the "shipping-mode1-map1-3568" battle server.
  4. Observe that the "agones_gameservers_total" metric is being collected, but "agones_gameservers_count" metric is not.
  5. Restart the controller.
  6. Check the metrics for the "shipping-mode1-map1-3568" battle server after the restart.
  7. Notice that the "agones_gameservers_total" metric is no longer being collected, while the "agones_gameservers_count" metric is still being collected.

Anything else we need to know?:

  1. In the cluster, there are a total of 10 fleets, and there is a continuous process of deleting existing fleets and creating new fleets. This dynamic fleet activity might have an impact on the metrics data collection.

Environment:

  • Agones version: 1.35.0

  • Kubernetes version (use kubectl version): Client Version: v1.27.2 Kustomize Version: v5.0.1 Server Version: v1.22.5-tke.19

  • Cloud provider or hardware configuration:

  • Install method (yaml/helm): helm

  • Troubleshooting guide log(s):

  • Others:

alvin-7 avatar Jan 23 '24 08:01 alvin-7

This sounds like works as intended.

  1. If a fleet is deleted and we restart the controller, we can't create the old metrics - it's all in memory.
  2. If a fleet is deleted we specifically remove it from all metrics reporting to ensure a memory leak / metric explosion doesn't happen (we have to do a full reset to do it).

See https://github.com/googleforgames/agones/issues/2478 for context.

markmandel avatar Jan 24 '24 01:01 markmandel

This sounds like works as intended.

  1. If a fleet is deleted and we restart the controller, we can't create the old metrics - it's all in memory.
  2. If a fleet is deleted we specifically remove it from all metrics reporting to ensure a memory leak / metric explosion doesn't happen (we have to do a full reset to do it).

See #2478 for context.

In my specific scenario, the controller metrics has already malfunctioned before the restart.

  1. Start the K8S controller - Agones Controller.
  2. Create "shipping-mode1-map1-3568" fleet
  3. Check the metrics for the "shipping-mode1-map1-3568" battle server.
  4. Observe that the "agones_gameservers_total" metric is being collected, but "agones_gameservers_count" metric is not.

alvin-7 avatar Jan 24 '24 02:01 alvin-7

That's a good point - will have to attempt to replicate 🤔

markmandel avatar Jan 24 '24 05:01 markmandel

That's a good point - will have to attempt to replicate 🤔

In our use case, Agones is configured with 10 fleets, and each fleet has a fleet autoscaler enabled. Additionally, 10 separate gameservers have been configured, which are not managed by the fleets.

Hope this can help you successfully reproduce the issue. Thank you for your hard work.

alvin-7 avatar Jan 25 '24 06:01 alvin-7

That's a good point - will have to attempt to replicate 🤔

Hello,markmandel,

I hope this message finds you well. I wanted to follow up on the issue. Furthermore, I understand that replicating the issue can sometimes be challenging, and I'm wondering if there's any additional information or assistance I can provide to facilitate the process.

If any specific scenarios, logs, or system configurations would be helpful, please let me know. I’m also willing to assist with testing or any other tasks that might help you address the issue more efficiently.

Looking forward to your guidance on how I can best support your efforts. Thank you for your time and attention to this matter.

Best regards

alvin-7 avatar Jan 31 '24 09:01 alvin-7

Sorry this isn't currently at the top of my priority queue, so haven't had a chance to look at it. Would definitely be happy to provide pointers if you wanted to dig into it?

markmandel avatar Feb 01 '24 03:02 markmandel

Sorry this isn't currently at the top of my priority queue, so haven't had a chance to look at it. Would definitely be happy to provide pointers if you wanted to dig into it?

Thank you for your prompt response. I completely understand that this issue may not be your top priority at the moment. I appreciate your willingness to provide pointers for further investigation.

If there's a more suitable time for you to delve into this matter or if you have any initial thoughts to share, I would be grateful for any guidance you can provide.

Looking forward to your insights.

alvin-7 avatar Feb 02 '24 05:02 alvin-7

If you would like to go digging (and i encourage it!), all these metrics are managed here: https://github.com/googleforgames/agones/tree/main/pkg/metrics

Feel free to drop questions here, or in #development channel on our Slack!

markmandel avatar Feb 06 '24 06:02 markmandel

We have replicated this issue locally and the agones_gameservers_total is missing after restarting the agones-controller.

Before restart: Screenshot 2024-02-20 at 3 05 07 PM

After: Screenshot 2024-02-20 at 3 05 47 PM

Kalaiselvi84 avatar Feb 20 '24 22:02 Kalaiselvi84

In Agones version 1.35.0, disabling the FeatureGate: "ResetMetricsOnDelete" can resolve issues with metrics anomalies.

Through an in-depth analysis of the source code, I've discovered that this feature can lead to certain memory optimization benefits. However, it also results in an increase in code complexity. Notably, during this optimization process, there seems to be a bug within the code that causes anomalies in the metrics indicators.

Based on these findings, I will attempt to fix this issue and provide a pull request (PR) if everything goes smoothly.

alvin-7 avatar Mar 06 '24 07:03 alvin-7

Thanks for digging in!

markmandel avatar Mar 06 '24 09:03 markmandel