integrations-core
integrations-core copied to clipboard
prometheus_client/parser.py unhelpful stacktrace
Describe what happened:
Datadog agent with prometheus integration complains about metric invalid format but the exception returned is not helpful:
2022-06-02 17:50:01 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:68 in Error) | check:openmetrics | Error running check: [{"message": "could not convert string to float: '17:49:50'", "traceback": "Traceback (most recent call last):
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 128, in _parse_sample
value, timestamp = _parse_value_and_timestamp(text[label_end + 2:])
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 113, in _parse_value_and_timestamp
return float(s), None
ValueError: could not convert string to float: ''
During handling of the above exception, another exception occurred:
From this exception we can see the metric value (17:49:50
) but we don't know what metric or even what target triggered this.
Describe what you expected:
Useful debugging information would be attached to the stack trace that could help us find which metric and which service is causing this.
Steps to reproduce the issue:
???
Additional environment details (Operating System, Cloud provider, etc):
Running agents on k8s.
Based on the stacktrace I'm also worried that the uncaught exception bubbled up to the top level causing the entire collection to fail instead of skipping just the one malformed metric:
Traceback (most recent call last):
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1071, in run
self.check(instance)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/base.py\", line 64, in check
scraper.scrape()
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/scraper.py\", line 238, in scrape
for metric in self.consume_metrics(runtime_data):
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/scraper.py\", line 258, in consume_metrics
for metric in metric_parser:
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/scraper.py\", line 276, in parse_metrics
for metric in self.parse_metric_families(line_streamer):
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 217, in text_fd_to_metric_families
sample = _parse_sample(line)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 140, in _parse_sample
value, timestamp = _parse_value_and_timestamp(text[name_end:])
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 114, in _parse_value_and_timestamp
value = float(values[0])
ValueError: could not convert string to float: '17:49:50'
Seeing lots of these spamming as well in our Datadog agent pods logs (deployed in EKS as a DaemonSet):
2022-09-27 14:35:01 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:68 in Error) | check:openmetrics | Error running check: [{"message": "could not convert string to float: '14:34:50'", "traceback": "Traceba │
│ ck (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 128, in _parse_sample\n value, timestamp = _parse_value_and_timestamp(text[label_end │
│ + 2:])\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 113, in _parse_value_and_timestamp\n return float(s), None\nValueError: could not convert string to float: │
│ ''\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", l │
│ ine 1006, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/base_check.py\", line 135, in check\n self.process(scraper_config)\ │
│ n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 533, in process\n for metric in self.scrape_metrics(scraper_config):\n File \"/opt/datadog- │
│ agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 504, in scrape_metrics\n for metric in self.parse_metric_family(response, scraper_config):\n File \"/opt/datadog-a │
│ gent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 428, in parse_metric_family\n for metric in text_fd_to_metric_families(input_gen):\n File \"/opt/datadog-agent/embe │
│ dded/lib/python3.8/site-packages/datadog_checks/base/checks/libs/prometheus.py\", line 76, in text_fd_to_metric_families\n sample = _parse_sample(line)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packag │
│ es/prometheus_client/parser.py\", line 140, in _parse_sample\n value, timestamp = _parse_value_and_timestamp(text[name_end:])\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parse │
│ r.py\", line 114, in _parse_value_and_timestamp\n value = float(values[0])\nValueError: could not convert string to float: '14:34:50'\n"}]
Hey @janrous-rai, sorry for the delayed response. I've created an internal ticket to work on a fix for the issues you've reported.
In the meantime, if you want someone from Datadog to look closer at your specific case, feel free to open a support ticket.
It would be interesting to see the response from the metrics endpoint if it conformed to Prometheus data model
Echoing this sentiment. Running the Datadog agent as part of a central platform that thousands of applications deploy to, I get hundreds of errors similar to this every minute:
2024-05-03 10:00:47 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:70 in Error) | check:openmetrics | Error running check: [{"message": "substring not found", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/prometheus_client/parser.py\", line 121, in _parse_sample\n value, timestamp = _parse_value_and_timestamp(text[label_end + 1:])\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/prometheus_client/parser.py\", line 106, in _parse_value_and_timestamp\n return float(s), None\nValueError: could not convert string to float: ''\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/base.py\", line 1235, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/base_check.py\", line 142, in check\n self.process(scraper_config)\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 581, in process\n for metric in self.scrape_metrics(scraper_config):\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 542, in scrape_metrics\n for metric in self.parse_metric_family(response, scraper_config):\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 466, in parse_metric_family\n for metric in text_fd_to_metric_families(input_gen):\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/libs/prometheus.py\", line 76, in text_fd_to_metric_families\n sample = _parse_sample(line)\n File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/prometheus_client/parser.py\", line 130, in _parse_sample\n name_end = text.index(separator)\nValueError: substring not found\n"}]
Without any information about where the openmetrics response that it tried to parse came from, or even what the metric line it tried to parse was, it's absolutely impossible to track down the actual cause of the error.
In #17514 I am at least logging the line that was failed to parse. I didn't include any information about the target, as I wanted to keep the change minimal and I wasn't sure how to get that information from within the parser, but it's still been extremely helpful to me to even just see the metric name. I've also opened a Datadog support ticket about this.