integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

Log warning when failing to parse openmetrics response

Open Nevon opened this issue 1 year ago • 5 comments

What does this PR do?

Log a warning when the plain text prometheus metric parsing fails, including the line that it failed to parse. For example:

2024-05-15 14:58:11 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:131 in LogMessage) | openmetrics:c2c_prometheus:21fa576d88f0d1f8 | (mixins.py:490) | Failed to parse metric response 'histogram_bucket{le="0.01"💣} 0': Invalid metric name: histogram_bucket{le="0.01"💣}

Motivation

This issue

Currently the text_fd_to_metric_families just raises generic errors that don't provide enough context to know what it was that couldn't be parsed, which is problematic when deployed into environments where a single agent is parsing metrics from many different sources via autodiscovery.

Additional Notes

I'm very much not a Python developer, so let me know if what I've done is very unidiomatic. I considered wrapping the exception in my own ParserException, but from reading the other checks that didn't seem like a common thing to do. I also wasn't able to actually run the integration tests locally, because they seem to only work with Docker and not docker-compatible clients like nerdctl, but I'm assuming the CI pipeline will.

Review checklist (to be filled by reviewers)

  • [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • [ ] Changelog entries must be created for modifications to shipped code
  • [ ] Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • [ ] If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Nevon avatar May 03 '24 14:05 Nevon

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.06%. Comparing base (8480233) to head (4f1e3bc). Report is 5 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
active_directory 100.00% <ø> (+27.27%) :arrow_up:
activemq 52.80% <ø> (ø)
activemq_xml 82.31% <ø> (ø)
airflow 92.20% <ø> (?)
amazon_msk 88.91% <ø> (ø)
ambari 85.80% <ø> (ø)
apache 95.08% <ø> (ø)
arangodb 98.23% <ø> (ø)
argo_rollouts 90.00% <ø> (ø)
argo_workflows 87.87% <ø> (ø)
argocd 87.81% <ø> (ø)
aspdotnet 100.00% <ø> (ø)
avi_vantage 91.35% <ø> (ø)
azure_iot_edge 82.08% <ø> (ø)
boundary 100.00% <ø> (ø)
btrfs 82.91% <ø> (ø)
cacti 87.90% <ø> (ø)
calico 84.61% <ø> (ø)
cassandra 66.66% <ø> (ø)
cert_manager 77.41% <ø> (ø)
cilium 78.20% <ø> (?)
cisco_aci 95.12% <ø> (+0.22%) :arrow_up:
citrix_hypervisor 87.50% <ø> (ø)
cloud_foundry_api 96.11% <ø> (ø)
cloudera 99.51% <ø> (ø)
cockroachdb 93.19% <ø> (ø)
consul 91.82% <ø> (ø)
coredns 94.61% <ø> (ø)
couch 94.74% <ø> (+0.58%) :arrow_up:
crio 89.79% <ø> (ø)
datadog_checks_base 89.72% <ø> (+1.02%) :arrow_up:
datadog_checks_dev 77.42% <ø> (+0.07%) :arrow_up:
datadog_checks_downloader 81.37% <ø> (ø)
datadog_cluster_agent 90.19% <ø> (ø)
dcgm 92.10% <ø> (ø)
ddev 88.01% <ø> (+0.68%) :arrow_up:
directory 95.68% <ø> (+0.43%) :arrow_up:
disk 89.34% <ø> (ø)
dns_check 93.33% <ø> (ø)
druid 97.70% <ø> (ø)
ecs_fargate 83.52% <ø> (ø)
eks_fargate 94.05% <ø> (ø)
envoy 95.18% <ø> (+5.67%) :arrow_up:
esxi 93.05% <ø> (ø)
etcd 95.56% <ø> (ø)
external_dns 89.28% <ø> (ø)
fluentd 84.32% <ø> (ø)
fluxcd 88.31% <ø> (ø)
foundationdb 83.83% <ø> (ø)
gearmand ?
gitlab_runner 92.10% <ø> (ø)
go_expvar 92.73% <ø> (ø)
gunicorn 92.07% <ø> (-0.76%) :arrow_down:
harbor ?
hazelcast 92.39% <ø> (ø)
hdfs_datanode 89.74% <ø> (ø)
hdfs_namenode 86.72% <ø> (ø)
hive 51.42% <ø> (ø)
hivemq 61.90% <ø> (ø)
http_check 95.32% <ø> (+2.02%) :arrow_up:
hudi 73.91% <ø> (?)
ibm_ace 92.25% <ø> (?)
ibm_db2 86.87% <ø> (ø)
ibm_i 81.91% <ø> (ø)
ibm_mq 91.14% <ø> (-0.14%) :arrow_down:
ibm_was ?
ignite 46.66% <ø> (ø)
impala 97.97% <ø> (ø)
istio 78.14% <ø> (+0.51%) :arrow_up:
jboss_wildfly 47.36% <ø> (ø)
kafka 64.70% <ø> (ø)
karpenter 94.36% <ø> (ø)
kong 87.62% <ø> (ø)
kube_apiserver_metrics 97.74% <ø> (ø)
kube_controller_manager 97.89% <ø> (ø)
kube_dns 95.97% <ø> (ø)
kube_metrics_server 94.87% <ø> (ø)
kube_proxy 96.80% <ø> (ø)
kube_scheduler 97.92% <ø> (ø)
kubelet 91.01% <ø> (ø)
kubernetes_cluster_autoscaler 93.22% <ø> (ø)
kubernetes_state 89.50% <ø> (ø)
kyototycoon 85.96% <ø> (ø)
kyverno 81.81% <ø> (ø)
lighttpd 83.64% <ø> (ø)
linkerd 85.22% <ø> (+1.13%) :arrow_up:
linux_proc_extras 96.22% <ø> (ø)
mapr 82.42% <ø> (ø)
mapreduce 82.08% <ø> (ø)
marathon 83.12% <ø> (ø)
mcache 93.50% <ø> (ø)
mesos_master 89.81% <ø> (ø)
mesos_slave 93.31% <ø> (ø)
nagios 89.01% <ø> (ø)
network 93.64% <ø> (+1.08%) :arrow_up:
nfsstat 95.20% <ø> (ø)
nginx 95.07% <ø> (+0.53%) :arrow_up:
nginx_ingress_controller 98.36% <ø> (ø)
nvidia_triton 88.52% <ø> (ø)
openldap 96.33% <ø> (ø)
openmetrics 98.08% <ø> (ø)
openstack 55.19% <ø> (ø)
openstack_controller 94.44% <ø> (?)
pgbouncer 91.35% <ø> (ø)
php_fpm 90.53% <ø> (+0.82%) :arrow_up:
postfix 88.10% <ø> (ø)
postgres ?
powerdns_recursor 96.65% <ø> (ø)
presto 59.09% <ø> (ø)
process 85.61% <ø> (+0.27%) :arrow_up:
prometheus 94.17% <ø> (ø)
proxysql 98.97% <ø> (ø)
pulsar 100.00% <ø> (ø)
rabbitmq 95.37% <ø> (ø)
ray 96.45% <ø> (ø)
redisdb ?
rethinkdb 97.93% <ø> (ø)
riak 99.21% <ø> (ø)
riakcs 87.71% <ø> (ø)
sap_hana ?
silk 93.82% <ø> (ø)
singlestore 79.08% <ø> (-11.74%) :arrow_down:
snowflake 96.27% <ø> (ø)
solr 56.25% <ø> (ø)
spark 93.87% <ø> (-0.28%) :arrow_down:
squid 100.00% <ø> (ø)
statsd 87.36% <ø> (ø)
strimzi 89.78% <ø> (ø)
supervisord 89.78% <ø> (ø)
system_core 92.66% <ø> (ø)
system_swap 98.30% <ø> (ø)
tcp_check 90.39% <ø> (ø)
teamcity 88.46% <ø> (+3.31%) :arrow_up:
tekton 82.30% <ø> (ø)
teleport 99.61% <ø> (ø)
temporal 100.00% <ø> (ø)
teradata 94.05% <ø> (ø)
tls 92.02% <ø> (+0.86%) :arrow_up:
tokumx 57.52% <ø> (ø)
tomcat 60.41% <ø> (?)
torchserve 97.32% <ø> (ø)
traefik_mesh 76.75% <ø> (ø)
traffic_server 96.13% <ø> (ø)
twemproxy 79.56% <ø> (ø)
twistlock 80.47% <ø> (ø)
varnish 84.39% <ø> (+0.26%) :arrow_up:
vllm 93.10% <ø> (ø)
voltdb ?
vsphere ?
weaviate 76.27% <ø> (ø)
weblogic 71.73% <ø> (?)
win32_event_log 82.67% <ø> (+1.11%) :arrow_up:
wmi_check 97.50% <ø> (ø)
yarn 89.52% <ø> (ø)
zk ?

Flags with carried forward coverage won't be shown. Click here to find out more.

codecov[bot] avatar May 03 '24 14:05 codecov[bot]

Test Results

 1 046 files   1 046 suites   6h 38m 45s :stopwatch:  7 053 tests  6 976 :white_check_mark:    75 :zzz: 1 :x: 1 :fire: 32 084 runs  26 588 :white_check_mark: 5 494 :zzz: 1 :x: 1 :fire:

For more details on these failures and errors, see this check.

Results for commit 9663a006.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar May 03 '24 15:05 github-actions[bot]

Let's bake the logging functionality into text_fd_to_metric_families.

Are you sure? The reason I didn't do that in there, which would have been simpler to do, is because it's a straight up copy from the Python prometheus client, with some minor change to avoid a backwards incompatible change made in the upstream (I didn't look much into it). If I change more things in there it'll become harder to get rid of this technical debt and use the upstream directly.

I'm fine with making those changes, but I just want confirmation that you're aware of this before I do.

Nevon avatar May 17 '24 16:05 Nevon

@Nevon good point! I still think we can proceed with the "baked in" approach:

  1. It's been 4 years since that code was committed. Not sure we're ever getting rid of that debt.
  2. Note how in the sketch I posted we don't modify the parsing logic but rather wrap it? This is on purpose, I too don't want to mix the book-keeping with the parsing. We can put a nice fat comment next to or inside the _parse function to make it explicit that we don't want to touch this parsing logic.

iliakur avatar May 20 '24 09:05 iliakur

Went ahead and implemented the changes we talked about. Thanks for teaching me about tee. As a non-python developer I'm not particularly familiar with the standard library, and my googling didn't lead me to any better way to get a reference to the current element of an iterator.

Nevon avatar May 23 '24 14:05 Nevon

@Nevon sorry this disappeared in the black void of GH notifications. I'm on it and will get back to you within a week (most likely sooner, but you know how it is with promises...)

Meanwhile I'm updating your branch by merging in master. all you should have to do is pull to sync up.

iliakur avatar Jul 25 '24 15:07 iliakur

No worries. I'm on vacation until the beginning of September, so no rush on my end. I'm glad to see that you're on it.

Nevon avatar Jul 25 '24 15:07 Nevon