Cluster should always aggregate metrics in a same order
The issue with the current metrics aggregation is that it occurs in the order in which individual workers respond to the GET_METRICS_REQ message. This may cause variations in the aggregation results when the metrics contain floats, such as a histogram.
For example, if workers respond to the first request with values 0.5848208, 0.5479198, 0.3437699 (which sum to 1.4765105), and then respond to a second request with the same values but in a different order (0.3437699+0.5848208+0.5479198), the result of the aggregation will be1.4765104999999998 in JS, due to float errors.
This can trigger a "reset" detection in Prometheus and severely impact the accuracy of the graphs.
This could perhaps be fixed by delaying:
message.metrics.forEach(metric => request.responses.push(metric));
request.pending--;
and instead do a Promise.all() against all n results an concatenates the results after they are all received instead of as they arrive.
I have a PR cooking for Worker support and I’m worried about propagating this error. I think maybe changing the logic to return a promise per worker might be better.
Paging @zbjornson , @SimenB to chime in.