integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

prometheus_client/parser.py unhelpful stacktrace

Open janrous-rai opened this issue 2 years ago • 6 comments

Describe what happened:

Datadog agent with prometheus integration complains about metric invalid format but the exception returned is not helpful:

2022-06-02 17:50:01 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:68 in Error) | check:openmetrics | Error running check: [{"message": "could not convert string to float: '17:49:50'", "traceback": "Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 128, in _parse_sample
    value, timestamp = _parse_value_and_timestamp(text[label_end + 2:])
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 113, in _parse_value_and_timestamp
    return float(s), None
ValueError: could not convert string to float: ''

During handling of the above exception, another exception occurred:

From this exception we can see the metric value (17:49:50) but we don't know what metric or even what target triggered this.

Describe what you expected:

Useful debugging information would be attached to the stack trace that could help us find which metric and which service is causing this.

Steps to reproduce the issue:

???

Additional environment details (Operating System, Cloud provider, etc):

Running agents on k8s.

janrous-rai avatar Jun 02 '22 17:06 janrous-rai

Based on the stacktrace I'm also worried that the uncaught exception bubbled up to the top level causing the entire collection to fail instead of skipping just the one malformed metric:

Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1071, in run
    self.check(instance)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/base.py\", line 64, in check
    scraper.scrape()
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/scraper.py\", line 238, in scrape
    for metric in self.consume_metrics(runtime_data):
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/scraper.py\", line 258, in consume_metrics
    for metric in metric_parser:
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/v2/scraper.py\", line 276, in parse_metrics
    for metric in self.parse_metric_families(line_streamer):
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 217, in text_fd_to_metric_families
    sample = _parse_sample(line)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 140, in _parse_sample
    value, timestamp = _parse_value_and_timestamp(text[name_end:])
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 114, in _parse_value_and_timestamp
    value = float(values[0])
ValueError: could not convert string to float: '17:49:50'

janrous-rai avatar Jun 02 '22 17:06 janrous-rai

Seeing lots of these spamming as well in our Datadog agent pods logs (deployed in EKS as a DaemonSet):

 2022-09-27 14:35:01 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:68 in Error) | check:openmetrics | Error running check: [{"message": "could not convert string to float: '14:34:50'", "traceback": "Traceba │
│ ck (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 128, in _parse_sample\n    value, timestamp = _parse_value_and_timestamp(text[label_end  │
│ + 2:])\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parser.py\", line 113, in _parse_value_and_timestamp\n    return float(s), None\nValueError: could not convert string to float: │
│  ''\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", l │
│ ine 1006, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/base_check.py\", line 135, in check\n    self.process(scraper_config)\ │
│ n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 533, in process\n    for metric in self.scrape_metrics(scraper_config):\n  File \"/opt/datadog- │
│ agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 504, in scrape_metrics\n    for metric in self.parse_metric_family(response, scraper_config):\n  File \"/opt/datadog-a │
│ gent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 428, in parse_metric_family\n    for metric in text_fd_to_metric_families(input_gen):\n  File \"/opt/datadog-agent/embe │
│ dded/lib/python3.8/site-packages/datadog_checks/base/checks/libs/prometheus.py\", line 76, in text_fd_to_metric_families\n    sample = _parse_sample(line)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packag │
│ es/prometheus_client/parser.py\", line 140, in _parse_sample\n    value, timestamp = _parse_value_and_timestamp(text[name_end:])\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/prometheus_client/parse │
│ r.py\", line 114, in _parse_value_and_timestamp\n    value = float(values[0])\nValueError: could not convert string to float: '14:34:50'\n"}]

gino-techops avatar Sep 27 '22 14:09 gino-techops

Hey @janrous-rai, sorry for the delayed response. I've created an internal ticket to work on a fix for the issues you've reported.

In the meantime, if you want someone from Datadog to look closer at your specific case, feel free to open a support ticket.

alopezz avatar Oct 17 '22 13:10 alopezz

It would be interesting to see the response from the metrics endpoint if it conformed to Prometheus data model

ian28223 avatar Feb 27 '23 02:02 ian28223

Echoing this sentiment. Running the Datadog agent as part of a central platform that thousands of applications deploy to, I get hundreds of errors similar to this every minute:

2024-05-03 10:00:47 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:70 in Error) | check:openmetrics | Error running check: [{"message": "substring not found", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/prometheus_client/parser.py\", line 121, in _parse_sample\n    value, timestamp = _parse_value_and_timestamp(text[label_end + 1:])\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/prometheus_client/parser.py\", line 106, in _parse_value_and_timestamp\n    return float(s), None\nValueError: could not convert string to float: ''\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/base.py\", line 1235, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/base_check.py\", line 142, in check\n    self.process(scraper_config)\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 581, in process\n    for metric in self.scrape_metrics(scraper_config):\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 542, in scrape_metrics\n    for metric in self.parse_metric_family(response, scraper_config):\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 466, in parse_metric_family\n    for metric in text_fd_to_metric_families(input_gen):\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/libs/prometheus.py\", line 76, in text_fd_to_metric_families\n    sample = _parse_sample(line)\n  File \"/opt/datadog-agent/embedded/lib/python3.9/site-packages/prometheus_client/parser.py\", line 130, in _parse_sample\n    name_end = text.index(separator)\nValueError: substring not found\n"}]

Without any information about where the openmetrics response that it tried to parse came from, or even what the metric line it tried to parse was, it's absolutely impossible to track down the actual cause of the error.

Nevon avatar May 03 '24 10:05 Nevon

In #17514 I am at least logging the line that was failed to parse. I didn't include any information about the target, as I wanted to keep the change minimal and I wasn't sure how to get that information from within the parser, but it's still been extremely helpful to me to even just see the metric name. I've also opened a Datadog support ticket about this.

Nevon avatar May 17 '24 08:05 Nevon