pelorus icon indicating copy to clipboard operation
pelorus copied to clipboard

Prometheus rework

Open mateusoliveira43 opened this issue 1 year ago • 3 comments

Fix

After #939, team members have agreed in the following:

  • remove deploy active metric
  • add clean up to metrics endpoint and increase scrapping time interval (no need to be every 15s)
  • add time window to exporters (only accept data with timestamp as value that is higher than exporter instance creation, but it not more than a threshold) (except committime exporter, because it only collects commit after a deployment)
  • explain this new behavior in the docs to user

Requirements

  • [x] remove deploy active metric
  • [ ] OPTIONAL Test if it is possible to deploytime metric do not use exactly timestamp (prometheus would use timestamp it received (because of ~3horus receive problem) and not use exactly timestamp as value)
  • [ ] add clean up to deploytime, committime and failure exporter
  • [ ] add clean up to webhook exporter (seems harder than on the old exporters)
  • [ ] add time window to exporters (except committime exporter)
  • [ ] explain this new behavior in the docs to user

mateusoliveira43 avatar Apr 28 '23 19:04 mateusoliveira43

this does in fact seem to be a nice written representation of what we spoke about.

  • add time window to exporters (only accept data with timestamp as value that is higher than exporter instance creation, but it not more than a threshold) (except committime exporter, because it only collects commit after a deployment)
  • explain this new behavior in the docs to user

I don't recall discussing these two, but seems like the right move to me.

  • remove deploy active metric
  • add clean up to metrics endpoint and increase scrapping time interval (no need to be every 15s)

weshayutin avatar May 01 '23 16:05 weshayutin

  • [ ] remove deploy active metric
  • This is resolved by https://github.com/dora-metrics/pelorus/pull/943
  • [ ] OPTIONAL Test if it is possible to deploytime metric do not use exactly timestamp (prometheus would use timestamp it received (because of ~3horus receive problem) and use exactly timestamp as value)
  • agree, additional tests:
    • check scenario where we deploy test application let's call it TestApp
      • deploy TestApp it 10 times every 30min (can be a simple shell script with sleep)
      • wait 2h before going into grafana
      • After that test number of deploymets over the following periods:
        • last 2 hours, should give 0
        • previous 2h (so 4h ago -> 2h ago) should give 4 deployments;
        • last week - should give 10 deployments.
    • check scenario where we create 10 deployments using webhook exporter and then restart webhook exporter (just delete running webhook instance).
      • wait 10 min
      • check how many deployments happened in the last week
      • check how many deployments happened in the last 5 minutes
  • [ ] add clean up to deploytime, committime and failure exporter
  • This I think is't really required as the scrape always get the new Gauge object, so we could have this for free. Just need to check if the caching in committime requires some cleanup.
  • [ ] add clean up to webhook exporter (seems harder than on the old exporters)
  • Agree. There should be a lifetime of metric after which it is removed. The way I see it:
    • Collect metrics not older than X (as implemented in #943)
    • Add function to check if the time of event where it happened is not greater then Y and remove it if it's true.
  • [ ] add time window to exporters (except committime exporter)
  • As @weshayutin reworded above.
  • [ ] explain this new behavior in the docs to user
  • Agree.

mpryc avatar May 02 '23 07:05 mpryc

RE: cleanup functionality

I think we should create a new class for the metrics that inherits from GaugeMetricFamily, let's call it PelorusGaugeMetricFamily.

This class should have at least three additional methods:

def clear_all_metrics(self) -> int:
"""
Clears out all the in memory metrics from the GaugeMetricFamily.

Returns:
  - number of metrics removed
"""

def clear_metrics(self, threshold_sec: int) -> int:
"""
Clears out the in memory metrics from the GaugeMetricFamily, which are
above certain threshold 

Arg:
  - threshold_sec (int): The time threshold in seconds. All metrics that occurred `threshold_sec` seconds ago or earlier will be removed.

Returns:
  - number of metrics removed.
"""

def remove_metric(self, labels: Sequence[str]) -> int:
"""
Remove particular metric from the GaugeMetricFamily.
Reverse function to GaugeMetricFamily.add_metric().

Arg:
  - labels Sequence[str]: The labels representing metric to be removed.

Returns:
  - number of metrics removed.
"""

mpryc avatar May 02 '23 09:05 mpryc