cmetrics icon indicating copy to clipboard operation
cmetrics copied to clipboard

encode_opentelemetry: add cut off for otel payloads for prometheus mimir

Open cosmo0920 opened this issue 1 year ago • 7 comments

This issue is reported in https://github.com/fluent/fluent-bit/issues/9400.

This is because Prometheus mimir limits the metrics' timestamps within 5 minutes in the same batch: https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020

cosmo0920 avatar Sep 25 '24 06:09 cosmo0920

what is the side effect of this for other endpoints/users ? is it ok to remove metrics for everybody ?

edsiper avatar Sep 26 '24 18:09 edsiper

A far I investigated fluent-bit is repeating infinitely (until restarted) metrics from devices or mounts that no longer exist:

  | Sep 27, 2024 @ 10:37:02.140 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:50:15.946Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra6.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:48.274 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:49:17.062Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra5.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:41.445 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:44:55.162Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra2.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:32.213 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:40:47.164Z and is from series node_filesystem_device_error{device="tmpfs", fstype="tmpfs", host_name="petra1.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:18.366 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:40:47.164Z and is from series node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", host_name="petra1.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:17.153 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:50:15.946Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra6.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:03.301 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:48:17.259Z and is from series node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", host_name="petra4.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:35:53.855 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:51:37.82Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra7.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:35:48.239 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:49:17.062Z and is from series node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", host_name="petra5.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)

Trying to push metrics from 3 days ago... (tmpfs filesystem after user session) I don't think anyone can benefit from this.

Regards Rafał

ElectricWeasel avatar Sep 27 '24 09:09 ElectricWeasel

Trying to push metrics from 3 days ago... (tmpfs filesystem after user session) I don't think anyone can benefit from this.

Regards Rafał

Just for confirming that this your log is applied this patch or not?

cosmo0920 avatar Sep 27 '24 09:09 cosmo0920

Trying to push metrics from 3 days ago... (tmpfs filesystem after user session) I don't think anyone can benefit from this. Regards Rafał

Just for confirming that this your log is applied this patch or not?

Ah sorry, i'ts a standard 3.1.2 version, I can try to compile from this branch and confirm.

Regards Rafał

ElectricWeasel avatar Sep 27 '24 09:09 ElectricWeasel

what is the side effect of this for other endpoints/users ? is it ok to remove metrics for everybody ?

I added APIs to specify cutoff options. This could be avoiding breaking changes for users who are using otel encoding.

cosmo0920 avatar Sep 30 '24 06:09 cosmo0920

Is this being planned in for a release soon? Any other testing etc. that is needed?

Brodiemm avatar Oct 23 '24 23:10 Brodiemm

I believe so. But even if it will be merged into fluent-bit tree, there is more works for implementing the cutoff related parameters on out_opentelemetry.

cosmo0920 avatar Oct 24 '24 01:10 cosmo0920