prometheus-cpp prometheus-cpp pull website incomplete in kubernetes

Hi - Im running prometheus-cpp v0.12.3 with gcc version 8.2.0 (GCC). My app works fine on my unix machine, but when I run under kubernetes the web endpoint doesnt get written completely. Here's an example:

# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 7568
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 37
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count

I also see the same partial write of my stats if I curl to the metrics endpoint from within the k8 pod.

Here's my pull endpoint when running outside k8:

# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 572
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 1
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 1
exposer_request_latencies_sum 4124
exposer_request_latencies{quantile="0.5"} 4124
exposer_request_latencies{quantile="0.9"} 4124
exposer_request_latencies{quantile="0.99"} 4124
# HELP total_ticks total ticks processed since start
# TYPE total_ticks counter
total_ticks{partition="2"} 9020
total_ticks{partition="3"} 9205
total_ticks{partition="1"} 8045
total_ticks{partition="0"} 8454
# HELP active_symbols active symbols in memory since flush
# TYPE active_symbols gauge
active_symbols{aaa="bbbb",ccc="ddd",partition="3"} 403
active_symbols{aaa="bbbb",ccc="ddd",partition="1"} 436
active_symbols{aaa="bbbb",ccc="ddd",partition="2"} 413
active_symbols{aaa="bbbb",ccc="ddd",partition="0"} 411
# HELP unique_symbols_written unique symbol count since start
# TYPE unique_symbols_written gauge
unique_symbols_written{partition="2"} 1662
unique_symbols_written{partition="3"} 1641
unique_symbols_written{partition="1"} 1661
unique_symbols_written{partition="0"} 1663
# HELP overall_message_latency end to end message latency
# TYPE overall_message_latency gauge
overall_message_latency{partition="2"} 611226227622370
overall_message_latency{partition="3"} 609943666622385
overall_message_latency{partition="1"} 607062113622370
overall_message_latency{partition="0"} 605046497622355
# HELP input_message_latency arctic native processing time
# TYPE input_message_latency gauge
input_message_latency{partition="2"} 0
input_message_latency{partition="3"} 0
input_message_latency{partition="1"} 0
input_message_latency{partition="0"} 999985

Have you seen anything like this?

Nov 02 '21 09:11 dunckerr

Hello,

I never experienced something similar. I'd like to narrow the error down to either the prometheus-cpp serialization or the civetweb HTTP stack.

I guess a good clue would be the Content-Length header of the HTTP response. If it matches the short content the error is within the serialization. If it is larger than the truncated content, the error is within the HTTP stack.

Would you be able to collect a tcpdump of such a truncated HTTP request / response? You could also share it privately with me, if necessary.

Thanks, Gregor

Nov 04 '21 07:11 gjasny

Another thought: If you run the code in k8s do you use some ingress/egress proxies or services like fluentd in-between the service and the http client?

Nov 04 '21 07:11 gjasny

Hi Gregor - thanks for quick response. That’s a good idea - I’ll get that info for you. Duncan

On Nov 4, 2021, at 7:24 AM, Gregor Jasny @.***> wrote:

Hello,

I never experienced something similar. I'd like to narrow the error down to either the prometheus-cpp serialization or the civetweb HTTP stack.

I guess a good clue would be the Content-Length header of the HTTP response. If it matches the short content the error is within the serialization. If it is larger than the truncated content, the error is within the HTTP stack.

Would you be able to collect a tcpdump of such a truncated HTTP request / response? You could also share it privately with me, if necessary.

Thanks, Gregor

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jupp0r/prometheus-cpp/issues/534#issuecomment-960514686, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA7RFWYU2PXPIHCWYKDSC43UKIYL7ANCNFSM5HGACTZA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Nov 04 '21 08:11 dunckerr

Hi Gregor - I've just dumped the metrics page from a couple of my k8 services, and the header byte count doesnt match the length of the request. In both cases the actual text length is greater than the Content-Length in the header. Ive attached two examples. The command I ran was "curl -i url:9091/metrics 2> /dev/null" scrape2.txt scrape1.txt

Nov 09 '21 15:11 dunckerr

The content of the HTTP response is exactly 451 bytes long:

# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 30765
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 140
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count

So somehow the serialization stopped right at the latencies summary.

Nov 12 '21 09:11 gjasny

Did you scrape directly on the POD or from the outside? If from the outside, could you try on the POD, right next to the application to rule-out any interference by services in-between?

Nov 12 '21 09:11 gjasny

It seems to abort right in between those two lines: https://github.com/jupp0r/prometheus-cpp/blob/342de5e93bd0cbafde77ec801f9dd35a03bceb3f/core/src/text_serializer.cc#L102-L103

Nov 12 '21 09:11 gjasny

Hi Gregor - we tried scraping the endpoint from outside the k8 pod, and also the curl command from running inside the container.

On Nov 12, 2021, at 9:29 AM, Gregor Jasny @.***> wrote:

Did you scrape directly on the POD or from the outside? If from the outside, could you try on the POD, right next to the application to rule-out any interference by services in-between?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jupp0r/prometheus-cpp/issues/534#issuecomment-966953350, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA7RFW3AQ326RJLYSNFSYKTULTM6NANCNFSM5HGACTZA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Nov 12 '21 21:11 dunckerr

Hi Gregor - the curl command I ran was from a bash shell inside the container. As far as I understand it, that's as close as I can get to the process.

Nov 15 '21 09:11 dunckerr

I'm running out of ideas what could go wrong. Could you please try version 0.13.0? There we simplified the serialization of doubles.

If that does not help you'll have to attach a debugger to the process to see where control flow is going wrong. Or use rr to record control flow and debug on your machine. Or sprinkle the prometheus-cpp code with printfs.

Nov 16 '21 08:11 gjasny

Fair enough - I’ll have a poke around at let you know how we get on. Thanks for your help. Duncan

On Nov 16, 2021, at 8:19 AM, Gregor Jasny @.***> wrote:

serializatio

Nov 17 '21 07:11 dunckerr

Hello,

did you find the culprit?

Thanks, Gregor

Dec 31 '21 11:12 gjasny

hi Gregor - not yet - xmas holidays haven't helped. We need to fix this, so I'll let you know any progress.

Duncan

On 31/12/2021 11:18, Gregor Jasny wrote:

Hello,

did you find the culprit?

Thanks, Gregor

— Reply to this email directly, view it on GitHub https://github.com/jupp0r/prometheus-cpp/issues/534#issuecomment-1003349785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7RFWZ3MDFN2WEPWZXMQ43UTWGRNANCNFSM5HGACTZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

Jan 02 '22 20:01 dunckerr

Were you successful with bug-hunting?

Mar 27 '22 16:03 gjasny

Hi Duncan,

could you please tell me (roughly) what the problem was?

Thanks, Gregor

Mar 02 '24 10:03 gjasny

prometheus-cpp prometheus-cpp copied to clipboard

prometheus-cpp pull website incomplete in kubernetes

prometheus-cpp
prometheus-cpp copied to clipboard