client_ruby icon indicating copy to clipboard operation
client_ruby copied to clipboard

Counter metric is reset when it has no labels

Open alex-wearisma opened this issue 2 years ago • 6 comments

Hello,

I implemented a custom MongoDB metric data store. While testing it, I noticed that counters with no labels are reset every time I relaunch my application. This does not happen to labeled counters.

After inspecting the code, it looks like this line resets the counters: https://github.com/prometheus/client_ruby/blob/8740786a4b46f46c1c3ad893a699c3ce51d468b9/lib/prometheus/client/metric.rb#L46

From top of my head I can't think of a scenario when this behaviour would be useful for counters or other metric types. Also, based on the tests, store should handle nil cases and initialise values to 0.0 before incrementing: https://github.com/prometheus/client_ruby/blob/8740786a4b46f46c1c3ad893a699c3ce51d468b9/spec/examples/data_store_example.rb#L33

Would it be safe to remove it?

Thank you, Alex

alex-wearisma avatar Apr 28 '22 15:04 alex-wearisma

Update

After further investigation, I see the reason why it was done: https://github.com/prometheus/client_ruby/commit/feccff3429cac056913ad7414b8744e61db57d85

I think it makes sense, and maybe it would be nice to present default values for labeled metrics too!

Though, I think, setting default values during the initialisation step, break persistent metric storages. What if instead of setting default values during metric creation, we return default values during presentation. It could work like this:

  1. Prometheus hits /metrics endpoint
  2. Exporter builds default values from metrics registry
  3. Exporter pulls metrics from the data store using all_values method
  4. Exporter updates default values with values pulled from data store
  5. Marshal formats text output and presents it to Prometheus.

This approach would solve missing values issue as well as make persistent storages work?

🤔

alex-wearisma avatar Apr 28 '22 16:04 alex-wearisma

Hello Alex,

The reason we initialize metrics that don’t have labels is because it’s recommended by Prometheus’s best practices

There SHOULD be a way to initialize a given Child with the default value, usually just calling labels(). Metrics without labels MUST always be initialized to avoid problems with missing metrics.

maybe it would be nice to present default values for labeled metrics too!

The problem here is that while the library knows what the labels will be for your metric (you specify these when declaring it),we can’t know what values those labels may have. As a consumer of the library, however, you might know what the values will be in advance (depends on the metric). If you do, it is recommended that you set those using the init_label_set method to define them.

if you do that, all your metrics will be constantly reported,instead of appearing and disappearing after restarts.

counters with no labels are reset every time I relaunch my application. setting default values during the initialisation step, break persistent metric storages.

This is by design, and intended. Prometheus metrics should reset after an application restart, and Prometheus is well equipped to deal with that, this is expected.

We do not support “persistent” metric storage. To the extent that your metrics are maintained after a restart, that’s an unfortunate side effect of our internal storage method, which happens to use files on disk to get around multi-process communication.

We would reset these files on restart if we could, but we haven’t figured out how to do so safely, in a way that’d support all possible use cases. Instead, we recommend that you clean these files up yourself on app start, and we provide some sample code on how to do this in our README.

dmagliola avatar Apr 30 '22 07:04 dmagliola

I’m assuming you’re using DirectFileStore, since you mention persistent metrics.

Here’s a number of things to keep in mind when using it

dmagliola avatar Apr 30 '22 07:04 dmagliola

Thank you for your response Daniel! This explains a lot about the client and it's intended use.

What do you think would be the best approach for the following case. We host an app in Heroku with the following setup:

  • 1x web dyno
  • 3x worker dynos (all workerss work on the same task, and have the same metrics)

We need to collect worker dynos' metrics, and since workers are separate virtual machines, they can not share metrics in files. We have two options here:

  1. Use pushgateway
  2. Implement centrilised metric store which is updated by the workers and exposed by the web dyno via /metrics endpoint

We implemented the first option and it works well. At the moment we are exploring the second one. We implemented a custom datastore to store metrics in MongoDB. The problem we are facing now is whenever we scale our workers, a new dyno is launched and metrics are reset, therefore we could loose metrics between Prometheus collections:

     worker records     worker records
        metrics            metrics
         | | |              | | |
   |-----------------|-------------------|
 first             heroku             second
 prometheus      scales and          prometheus
 scrape         resets metrics         scrape
   (-----------------)
      these metrics
        will be
         lost

We have the following options:

  1. As mentioned before, present Prometheus with dafault metric values when it collects metrics from /metrics endpoint (do not reset metric, when metric object is initialised)
  2. Add host_id + process_id labels to the metric, and aggregate values in all_values datastore method (the problem with this approach is that we don't know when dynos are cycled/scaled and when new host_id + process_id pairs are introduced and old ones are retired, therefore we don't know which host_id + process_id metrics to include in all_values aggregation)

Thank you, for your thoughts!

alex-wearisma avatar Apr 30 '22 14:04 alex-wearisma

Hi @alex-wearisma, sorry for the silence.

Truth be told, I've skimmed over your situation a few times and I've not had any amazing ideas.

Running Prometheus in environments where you can't indivdually make HTTP requests to each process that you want to scrape metrics from is always a pain. DirectFileStore solves the case where those processes are all forks of a single parent, but in many ways that's one of the easier cases to solve (not to diminish the hard work @dmagliola did to make that work!).

I've not deployed anything beyond a hobby app on Heroku, so correct me if my understanding is wrong (it has been pieced together from a few searches). My understanding is:

  • It's possible to talk to individual dynos, though getting their IPs is a pain (seems like you'd need to do service discovery by running the heroku CLI to get the IPs and then feed them into Prometheus's file-based service discovery by writing out a file periodically)
  • The same is not true for workers, which aren't addressable over the public internet (less sure if this is true)

Assuming you want to continue with Prometheus, I think this leaves you in heavily custom territory. What I'm about to suggest is very much in "here's what I might try" territory, rather than something I'm sure will work well.

I think your second option might be workable, but you'd probably need to add a per-host/process heartbeat into the storage, and only return metrics for that host/process if its heartbeat has been updated within some period (e.g. the last 5m). You'd want to reserve a label for that purpose which you know nothing else will use.

The other thing you'll need to think about (assuming these are being saved to durable storage) is cleaning out any metrics from old processes that haven't sent a heartbeat in a really long time (e.g. 12-24h).

That's probably not as simple an answer as you were looking for, and it might be too late (I assume you've figured something out in the meantime), but it's all I could really think of after a few rounds of reading through this thread.

Sinjo avatar Jul 10 '22 16:07 Sinjo

Hello @Sinjo . Thank you for your thoughts! And thank you @Sinjo and @dmagliola for your help!

alex-wearisma avatar Aug 10 '22 08:08 alex-wearisma

No problem @alex-wearisma. I'm going to close this issue since it's not something we'll be working on in the core library.

You may be interested in reading this discussion that was opened last week. It's not exactly the same use-case as what you're doing, but you might find some inspiration in what @KaoruDev is working on (assuming it ends up being open source).

Sinjo avatar Aug 21 '22 12:08 Sinjo

Thank you @Sinjo !

alex-wearisma avatar Aug 22 '22 08:08 alex-wearisma