zrepl icon indicating copy to clipboard operation
zrepl copied to clipboard

Prometheus: report bytes remaining

Open asomers opened this issue 1 year ago • 4 comments

For recently added jobs that haven't finished a full initial replication, it would be useful to report how much data remains to transfer. For example, if zrepl status shows this:

Progress: [\--------------------------------------------------] 252.6 GiB / 22.3 TiB @ 121.0 MiB/s (2d  5h  1m 28s remaining)
* foo  STEPPING (step 1/25, 16.2 GiB/3.3 TiB) next: full send @zrepl_foo20240211_100807_000 (resumed)
* bar  STEPPING (step 1/23, 236.4 GiB/19.0 TiB) next: full send @zrepl_bar20240211_100807_000 (resumed)

Then it would be useful for the Prometheus exporter to report something like this:

zrepl_replication_bytes_remaining{filesystem="foo",zrepl_job="myjob"} 3.61099e+12
zrepl_replication_bytes_remaining{filesystem="bar",zrepl_job="myjob"} 2.06369e+13

asomers avatar Mar 20 '24 14:03 asomers

Sorry for the late reply.

The current architecture rebuilds all replication state from ground truth every replication attempt.

There is no long-lived concept of a filesystem within a job.

Hence, difficult to attach a metric to it.

I guess after an attempt finishes we could keep its gauge objects around until the next attempt.

I think it would help me to understand how would you use these metrics in a dashboard / alerting rule?

problame avatar Oct 27 '24 23:10 problame

We would use the metrics to see total zrepl throughput, and we would also use it to alert for hung jobs.

asomers avatar Oct 27 '24 23:10 asomers

So, what's the alert rule, precisely? You proposed a gauge, these are notoriously annoying to write robust alert rules for.

problame avatar Oct 27 '24 23:10 problame

Here's an example a similar alert for a stalled transfer by a different application. It combines metrics from two exporters, raising an alert for any transfer that's been stalled for a whole day. I've noticed that zrepl sometimes hangs, so it would be very useful to have such an alert.

zfs_dataset_used_bytes{dataset=~SOME_PATTERN} unless on(dataset) last_over_time(some_exporter_bytes_total[24h])

Also, we've noticed performance problems with zrepl (see https://github.com/zrepl/zrepl/discussions/775 ). A metric like the one I proposed would help us to evaluate any speed improvements we can make. Even if the exporter doesn't publish metrics for inactive jobs, we could still gain insights into zrepl's performance from it.

asomers avatar Oct 28 '24 01:10 asomers