Log warning when failing to parse openmetrics response
What does this PR do?
Log a warning when the plain text prometheus metric parsing fails, including the line that it failed to parse. For example:
2024-05-15 14:58:11 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:131 in LogMessage) | openmetrics:c2c_prometheus:21fa576d88f0d1f8 | (mixins.py:490) | Failed to parse metric response 'histogram_bucket{le="0.01"💣} 0': Invalid metric name: histogram_bucket{le="0.01"💣}
Motivation
Currently the text_fd_to_metric_families just raises generic errors that don't provide enough context to know what it was that couldn't be parsed, which is problematic when deployed into environments where a single agent is parsing metrics from many different sources via autodiscovery.
Additional Notes
I'm very much not a Python developer, so let me know if what I've done is very unidiomatic. I considered wrapping the exception in my own ParserException, but from reading the other checks that didn't seem like a common thing to do. I also wasn't able to actually run the integration tests locally, because they seem to only work with Docker and not docker-compatible clients like nerdctl, but I'm assuming the CI pipeline will.
Review checklist (to be filled by reviewers)
- [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
- [ ] Changelog entries must be created for modifications to shipped code
- [ ] Add the
qa/skip-qalabel if the PR doesn't need to be tested during QA. - [ ] If you need to backport this PR to another branch, you can add the
backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 90.06%. Comparing base (
8480233) to head (4f1e3bc). Report is 5 commits behind head on master.
Additional details and impacted files
| Flag | Coverage Δ | |
|---|---|---|
| active_directory | 100.00% <ø> (+27.27%) |
:arrow_up: |
| activemq | 52.80% <ø> (ø) |
|
| activemq_xml | 82.31% <ø> (ø) |
|
| airflow | 92.20% <ø> (?) |
|
| amazon_msk | 88.91% <ø> (ø) |
|
| ambari | 85.80% <ø> (ø) |
|
| apache | 95.08% <ø> (ø) |
|
| arangodb | 98.23% <ø> (ø) |
|
| argo_rollouts | 90.00% <ø> (ø) |
|
| argo_workflows | 87.87% <ø> (ø) |
|
| argocd | 87.81% <ø> (ø) |
|
| aspdotnet | 100.00% <ø> (ø) |
|
| avi_vantage | 91.35% <ø> (ø) |
|
| azure_iot_edge | 82.08% <ø> (ø) |
|
| boundary | 100.00% <ø> (ø) |
|
| btrfs | 82.91% <ø> (ø) |
|
| cacti | 87.90% <ø> (ø) |
|
| calico | 84.61% <ø> (ø) |
|
| cassandra | 66.66% <ø> (ø) |
|
| cert_manager | 77.41% <ø> (ø) |
|
| cilium | 78.20% <ø> (?) |
|
| cisco_aci | 95.12% <ø> (+0.22%) |
:arrow_up: |
| citrix_hypervisor | 87.50% <ø> (ø) |
|
| cloud_foundry_api | 96.11% <ø> (ø) |
|
| cloudera | 99.51% <ø> (ø) |
|
| cockroachdb | 93.19% <ø> (ø) |
|
| consul | 91.82% <ø> (ø) |
|
| coredns | 94.61% <ø> (ø) |
|
| couch | 94.74% <ø> (+0.58%) |
:arrow_up: |
| crio | 89.79% <ø> (ø) |
|
| datadog_checks_base | 89.72% <ø> (+1.02%) |
:arrow_up: |
| datadog_checks_dev | 77.42% <ø> (+0.07%) |
:arrow_up: |
| datadog_checks_downloader | 81.37% <ø> (ø) |
|
| datadog_cluster_agent | 90.19% <ø> (ø) |
|
| dcgm | 92.10% <ø> (ø) |
|
| ddev | 88.01% <ø> (+0.68%) |
:arrow_up: |
| directory | 95.68% <ø> (+0.43%) |
:arrow_up: |
| disk | 89.34% <ø> (ø) |
|
| dns_check | 93.33% <ø> (ø) |
|
| druid | 97.70% <ø> (ø) |
|
| ecs_fargate | 83.52% <ø> (ø) |
|
| eks_fargate | 94.05% <ø> (ø) |
|
| envoy | 95.18% <ø> (+5.67%) |
:arrow_up: |
| esxi | 93.05% <ø> (ø) |
|
| etcd | 95.56% <ø> (ø) |
|
| external_dns | 89.28% <ø> (ø) |
|
| fluentd | 84.32% <ø> (ø) |
|
| fluxcd | 88.31% <ø> (ø) |
|
| foundationdb | 83.83% <ø> (ø) |
|
| gearmand | ? |
|
| gitlab_runner | 92.10% <ø> (ø) |
|
| go_expvar | 92.73% <ø> (ø) |
|
| gunicorn | 92.07% <ø> (-0.76%) |
:arrow_down: |
| harbor | ? |
|
| hazelcast | 92.39% <ø> (ø) |
|
| hdfs_datanode | 89.74% <ø> (ø) |
|
| hdfs_namenode | 86.72% <ø> (ø) |
|
| hive | 51.42% <ø> (ø) |
|
| hivemq | 61.90% <ø> (ø) |
|
| http_check | 95.32% <ø> (+2.02%) |
:arrow_up: |
| hudi | 73.91% <ø> (?) |
|
| ibm_ace | 92.25% <ø> (?) |
|
| ibm_db2 | 86.87% <ø> (ø) |
|
| ibm_i | 81.91% <ø> (ø) |
|
| ibm_mq | 91.14% <ø> (-0.14%) |
:arrow_down: |
| ibm_was | ? |
|
| ignite | 46.66% <ø> (ø) |
|
| impala | 97.97% <ø> (ø) |
|
| istio | 78.14% <ø> (+0.51%) |
:arrow_up: |
| jboss_wildfly | 47.36% <ø> (ø) |
|
| kafka | 64.70% <ø> (ø) |
|
| karpenter | 94.36% <ø> (ø) |
|
| kong | 87.62% <ø> (ø) |
|
| kube_apiserver_metrics | 97.74% <ø> (ø) |
|
| kube_controller_manager | 97.89% <ø> (ø) |
|
| kube_dns | 95.97% <ø> (ø) |
|
| kube_metrics_server | 94.87% <ø> (ø) |
|
| kube_proxy | 96.80% <ø> (ø) |
|
| kube_scheduler | 97.92% <ø> (ø) |
|
| kubelet | 91.01% <ø> (ø) |
|
| kubernetes_cluster_autoscaler | 93.22% <ø> (ø) |
|
| kubernetes_state | 89.50% <ø> (ø) |
|
| kyototycoon | 85.96% <ø> (ø) |
|
| kyverno | 81.81% <ø> (ø) |
|
| lighttpd | 83.64% <ø> (ø) |
|
| linkerd | 85.22% <ø> (+1.13%) |
:arrow_up: |
| linux_proc_extras | 96.22% <ø> (ø) |
|
| mapr | 82.42% <ø> (ø) |
|
| mapreduce | 82.08% <ø> (ø) |
|
| marathon | 83.12% <ø> (ø) |
|
| mcache | 93.50% <ø> (ø) |
|
| mesos_master | 89.81% <ø> (ø) |
|
| mesos_slave | 93.31% <ø> (ø) |
|
| nagios | 89.01% <ø> (ø) |
|
| network | 93.64% <ø> (+1.08%) |
:arrow_up: |
| nfsstat | 95.20% <ø> (ø) |
|
| nginx | 95.07% <ø> (+0.53%) |
:arrow_up: |
| nginx_ingress_controller | 98.36% <ø> (ø) |
|
| nvidia_triton | 88.52% <ø> (ø) |
|
| openldap | 96.33% <ø> (ø) |
|
| openmetrics | 98.08% <ø> (ø) |
|
| openstack | 55.19% <ø> (ø) |
|
| openstack_controller | 94.44% <ø> (?) |
|
| pgbouncer | 91.35% <ø> (ø) |
|
| php_fpm | 90.53% <ø> (+0.82%) |
:arrow_up: |
| postfix | 88.10% <ø> (ø) |
|
| postgres | ? |
|
| powerdns_recursor | 96.65% <ø> (ø) |
|
| presto | 59.09% <ø> (ø) |
|
| process | 85.61% <ø> (+0.27%) |
:arrow_up: |
| prometheus | 94.17% <ø> (ø) |
|
| proxysql | 98.97% <ø> (ø) |
|
| pulsar | 100.00% <ø> (ø) |
|
| rabbitmq | 95.37% <ø> (ø) |
|
| ray | 96.45% <ø> (ø) |
|
| redisdb | ? |
|
| rethinkdb | 97.93% <ø> (ø) |
|
| riak | 99.21% <ø> (ø) |
|
| riakcs | 87.71% <ø> (ø) |
|
| sap_hana | ? |
|
| silk | 93.82% <ø> (ø) |
|
| singlestore | 79.08% <ø> (-11.74%) |
:arrow_down: |
| snowflake | 96.27% <ø> (ø) |
|
| solr | 56.25% <ø> (ø) |
|
| spark | 93.87% <ø> (-0.28%) |
:arrow_down: |
| squid | 100.00% <ø> (ø) |
|
| statsd | 87.36% <ø> (ø) |
|
| strimzi | 89.78% <ø> (ø) |
|
| supervisord | 89.78% <ø> (ø) |
|
| system_core | 92.66% <ø> (ø) |
|
| system_swap | 98.30% <ø> (ø) |
|
| tcp_check | 90.39% <ø> (ø) |
|
| teamcity | 88.46% <ø> (+3.31%) |
:arrow_up: |
| tekton | 82.30% <ø> (ø) |
|
| teleport | 99.61% <ø> (ø) |
|
| temporal | 100.00% <ø> (ø) |
|
| teradata | 94.05% <ø> (ø) |
|
| tls | 92.02% <ø> (+0.86%) |
:arrow_up: |
| tokumx | 57.52% <ø> (ø) |
|
| tomcat | 60.41% <ø> (?) |
|
| torchserve | 97.32% <ø> (ø) |
|
| traefik_mesh | 76.75% <ø> (ø) |
|
| traffic_server | 96.13% <ø> (ø) |
|
| twemproxy | 79.56% <ø> (ø) |
|
| twistlock | 80.47% <ø> (ø) |
|
| varnish | 84.39% <ø> (+0.26%) |
:arrow_up: |
| vllm | 93.10% <ø> (ø) |
|
| voltdb | ? |
|
| vsphere | ? |
|
| weaviate | 76.27% <ø> (ø) |
|
| weblogic | 71.73% <ø> (?) |
|
| win32_event_log | 82.67% <ø> (+1.11%) |
:arrow_up: |
| wmi_check | 97.50% <ø> (ø) |
|
| yarn | 89.52% <ø> (ø) |
|
| zk | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
Test Results
1 046 files 1 046 suites 6h 38m 45s :stopwatch: 7 053 tests 6 976 :white_check_mark: 75 :zzz: 1 :x: 1 :fire: 32 084 runs 26 588 :white_check_mark: 5 494 :zzz: 1 :x: 1 :fire:
For more details on these failures and errors, see this check.
Results for commit 9663a006.
:recycle: This comment has been updated with latest results.
Let's bake the logging functionality into
text_fd_to_metric_families.
Are you sure? The reason I didn't do that in there, which would have been simpler to do, is because it's a straight up copy from the Python prometheus client, with some minor change to avoid a backwards incompatible change made in the upstream (I didn't look much into it). If I change more things in there it'll become harder to get rid of this technical debt and use the upstream directly.
I'm fine with making those changes, but I just want confirmation that you're aware of this before I do.
@Nevon good point! I still think we can proceed with the "baked in" approach:
- It's been 4 years since that code was committed. Not sure we're ever getting rid of that debt.
- Note how in the sketch I posted we don't modify the parsing logic but rather wrap it? This is on purpose, I too don't want to mix the book-keeping with the parsing. We can put a nice fat comment next to or inside the
_parsefunction to make it explicit that we don't want to touch this parsing logic.
Went ahead and implemented the changes we talked about. Thanks for teaching me about tee. As a non-python developer I'm not particularly familiar with the standard library, and my googling didn't lead me to any better way to get a reference to the current element of an iterator.
@Nevon sorry this disappeared in the black void of GH notifications. I'm on it and will get back to you within a week (most likely sooner, but you know how it is with promises...)
Meanwhile I'm updating your branch by merging in master. all you should have to do is pull to sync up.
No worries. I'm on vacation until the beginning of September, so no rush on my end. I'm glad to see that you're on it.