client_python Remove gauge metric

Hi all, I have written a custom exporter for calculating the cost of using a node in AWS. Here is how it works: Lets assume I have 3 nodes, each costs 5$/ 1h, when I plot using grafana the sum(cost_metric{}) i get 15$ (3x5$). Lets say after 2 hours one of the nodes get deleted (autoscaling). In that case the total cost should drop to 10$.

The problem is that in my case the metric is preserved and even though the node has been deleted the cost is kept and thus it displays 15$ instead of dropping to 10$

How would I go about saving that problem?

cost_metric = Gauge(
    "cost_metric ",
    "Cost of running an instance for 1 hour",
    ["node_name", "instance_type"],
)
...

node_names = get_nodes()
    for node_name in node_names:
        node_info = get_node_info(node_name)
        if node_info is None:
            continue

        logging.info(f"Updating metrics for node: {node_name}")

        # labels section
        labels = node_info["metadata"]["labels"]
        instance_type = labels.get("beta.kubernetes.io/instance-type", "unknown")
        cost = get_cost_of_instance(instance_type)

        if cost is not None:
            cost_metric.labels(node_name=node_name, instance_type=instance_type).set(cost)

I tried

I collected previous and current nodes in form of dict and then wanted to removed the ones that arent existing, the issue is that :

cost_metric.remove(node_name=node_name, instance_type=instance_type)

Traceback (most recent call last):
  File "/home/XXXX/projects/main.py", line 154, in <module>
    previous_nodes = update_metrics(previous_nodes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eksohio/projects/main.py", line 131, in update_metrics
    cost.remove(node_name=node_name, instance_type=instance_type)
TypeError: MetricWrapperBase.remove() got an unexpected keyword argument 'node_name'

Aug 07 '24 09:08 danielstankw

I can set the cost to 0, and bypass it that way, but it will still result in metric that is no longer needed being preserved and thus over time, consuming space. :/

Aug 07 '24 13:08 danielstankw

Hello, this sounds like the use case for a custom collector: https://prometheus.github.io/client_python/collector/custom/. You will only add metrics for the nodes that you want to include in the output so no extra series will be left around.

Aug 22 '24 18:08 csmarchbanks

@csmarchbanks thanks for the hint, Would you be able to elaborate a bit more on how would that work>?

Aug 26 '24 07:08 danielstankw

That would work by running your get_node and other logic during each scrape and only having cost_metric exist for the lifetime of the scrape. That way if an instance disappears it will automatically just not appear during the next scrape's output. Adapting the example a bit for your case (I have not run/tested this but it should give the idea):

from prometheus_client.core import GaugeMetricFamily, REGISTRY
from prometheus_client.registry import Collector

class CustomCollector(Collector):
    def collect(self):
        cost_metric = GaugeMetricFamily("cost_metric ",
            "Cost of running an instance for 1 hour",
            ["node_name", "instance_type"],
        )

        node_names = get_nodes()
        for node_name in node_names:
            node_info = get_node_info(node_name)
            # ... collect label info, etc... from your code.
            if cost is not None:
                cost_metric.labels(node_name=node_name, instance_type=instance_type).set(cost)

        yield cost_metric

REGISTRY.register(CustomCollector())

Aug 30 '24 19:08 csmarchbanks

@csmarchbanks I will test it out, thanks a ton for taking your time and providing an example :)

Aug 31 '24 20:08 danielstankw

@csmarchbanks I have implemented an exporter as suggested. The issue i am facing now is that because I expose metric every 10min. In Grafana dashboard or prometheus I cant see the metric at any time, but only at specific intervals.

What I mean by that is as follows: The metric is available for 5 minutes, with 10 minute break. Therefore If I try to query the metric at the time between its available and a new scrape is performed I get empty dashboard. I would want to see the metric at all times, but it should automatically update when ex. node gets deleted

Nov 01 '24 13:11 danielstankw

@csmarchbanks I have implemented an exporter as suggested. The issue i am facing now is that because I expose metric every 10min. In Grafana dashboard or prometheus I cant see the metric at any time, but only at specific intervals.

What I mean by that is as follows: The metric is available for 5 minutes, with 10 minute break. Therefore If I try to query the metric at the time between its available and a new scrape is performed I get empty dashboard. I would want to see the metric at all times, but it should automatically update when ex. node gets deleted

same problem, any suggestion?

Nov 14 '24 06:11 junneyang