prom-client icon indicating copy to clipboard operation
prom-client copied to clipboard

Cluster should always aggregate metrics in a same order

Open matej21 opened this issue 2 years ago • 2 comments

The issue with the current metrics aggregation is that it occurs in the order in which individual workers respond to the GET_METRICS_REQ message. This may cause variations in the aggregation results when the metrics contain floats, such as a histogram.

For example, if workers respond to the first request with values 0.5848208, 0.5479198, 0.3437699 (which sum to 1.4765105), and then respond to a second request with the same values but in a different order (0.3437699+0.5848208+0.5479198), the result of the aggregation will be1.4765104999999998 in JS, due to float errors.

This can trigger a "reset" detection in Prometheus and severely impact the accuracy of the graphs.

matej21 avatar Feb 03 '23 11:02 matej21

This could perhaps be fixed by delaying:

			message.metrics.forEach(metric => request.responses.push(metric));
			request.pending--;

and instead do a Promise.all() against all n results an concatenates the results after they are all received instead of as they arrive.

jdmarshall avatar Jul 05 '25 08:07 jdmarshall

I have a PR cooking for Worker support and I’m worried about propagating this error. I think maybe changing the logic to return a promise per worker might be better.

Paging @zbjornson , @SimenB to chime in.

jdmarshall avatar Jul 13 '25 14:07 jdmarshall