telegraf Cyclictest utility reports latency increase when telegraf is running with intel-powerstat plugin on preempt-rt kernel

Relevant telegraf.conf

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "10s"
  flush_jitter = "0s"
  hostname = "$HOSTNAME"
  interval = "10s"
  logfile = ""
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = true


[[outputs.prometheus_client]]
  listen = ":9273"


[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.linux_cpu]]
  metrics = ["cpufreq"]
  interval = "30s"
[[inputs.ipmi_sensor]]
  interval = "30s"
  timeout = "20s"
  metric_version = 1
  use_sudo = false
[[inputs.intel_powerstat]]
  cpu_metrics = ["cpu_frequency", "cpu_busy_frequency", "cpu_temperature", "cpu_c0_state_residency","cpu_c1_state_residency", "cpu_c6_state_residency", "cpu_busy_cycles"]
  package_metrics = ["current_power_consumption", "current_dram_power_consumption" ,"thermal_design_power", "cpu_base_frequency" ,"uncore_frequency"]
[[inputs.mem]]

Logs from Telegraf

This issue does not show any relevant information on logs

System info

Telegraf 1.27.3, Debian 11, linux-yocto-rt kernel 5.10 with upstream patches

Docker

No response

Steps to reproduce

Use a preempt-rt kernel on a linux distribution
Install rt-tests package from your linux distribution (or compile it from source at https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git)
With the system on steady state, execute cyclictest utility from a long period of time (at least 6 hrs) and save latency output data for reference.
Start telegraf without intel-powerstat plugin enabled
Execute cyclictest utility and save latency output data
Start telegraf with intel-powerstat plugin enabled
Execute cyclictest utility and save latency output data

Expected behavior

Latencies from telegraf execution with and without intel-powerstat plugin should not differ so much

Actual behavior

There is a huge difference on latencies when telegraf is running with intel-powerstat plugin

Additional info

Further investigation indicated this occurs because of the concurrent read of MSR with goroutines.

Aug 24 '23 14:08 alysondeives

Hi,

There is a huge difference on latencies between steady state and telegraf execution

Can you quantify what this actually looks like in practice?

Latencies from steady state system and telegraf execution should not differ so much

You are comparing an idle system to one that is actively getting metrics about the system every 10 seconds. I would expect there to be differences. As is, I am not clear on what the actionable item is with this issue.

@zak-pawel any additional thoughts?

Aug 24 '23 15:08 powersj

Hi @powersj, I missed to inform another test case, that is telegraf without intel-powerstat plugin, sorry. I will edit the issue to make it clear.

If you compare telegraf with and without the plugin, it is possible to notice a huge increase on latency.

I already performed an investigation, and the issue occurs on the concurrent read of MSR files using go routines. Pull request #13829 fix it.

Aug 24 '23 19:08 alysondeives

@alysondeives,

Thanks for the details and more importantly the PR!

I'll have the Intel folks review it as well. I would be curious to know if they had a reason to grab all the metrics with go routines versus doing it one at a time. Additionally, it would be good to know if there is any impact to the validity of the metrics to ensure we are not changing any behavior for end users. I assume the answer is no, but I'd like to get an answer from them as well.

Thanks again

Aug 24 '23 19:08 powersj

Hello @alysondeives, thanks for the issue and PR.

There is a particular reason to grab all the metrics with go routines: we want to read values from MSRs (per particular core) as closely to each other as possible. Why? In order for our calculations to be as precise as possible. And we are closest to this situation precisely when various counters are read from MSRs as closely to each other as possible.

As you can see in code:

for CPU Busy Frequency metric we need to read values from maximumFrequencyClockCountLocation, actualFrequencyClockCountLocation and timestampCounterLocation to do proper calculations.
for CPU C0 State residency metric we need to read values from maximumFrequencyClockCountLocation and timestampCounterLocation to do proper calculations.
for CPU C1 State Residency metric we need to read values from maximumFrequencyClockCountLocation, c3StateResidencyLocation, c6StateResidencyLocation, c7StateResidencyLocation and timestampCounterLocation to do proper calculations.
for CPU C6 State Residency metric we need to read values from c6StateResidencyLocation and timestampCounterLocation to do proper calculations.

Given all the above, I'm ok with reading MSRs sequentially as long as:

it will depend on config parameter, for example: read_method with possible values: concurrent or sequential
default will be: concurrent (to have calculations as precise as possible by default and also for backward compatibility)
it will be very well described in README.md, sample.conf and code to show user pros and cons of each option
- please search for all occurrences of read_method in https://github.com/influxdata/telegraf/blob/master/plugins/inputs/smart/README.md to see how it can be done

We have been working on additional features for this plugin (changes will be upstreamed in few weeks), and one of them will be core/cpu mask in config which will allow user to exclude particular cores from MSR reads. Thanks to that user will be able to exclude isolated cores on which sensitive, lowlatency workload is run from "expensive MSR reads".

Aug 24 '23 20:08 p-zak

There is a huge difference on latencies when telegraf is running with intel-powerstat plugin

@alysondeives

I would love to see both results.

Also, can you please share:

Model(s) of CPU on which you performed these tests
OS: uname -a
Output from lscpu
Kernel parameters: cat /proc/cmdline
Did you disable irqbalance.service before running benchmarks?
- Can you paste output from systemctl list-unit-files | grep -e "enabled" -e "UNIT FILE" ?
Did you run cyclictest on isolated cores?
Did you tune your environment properly for low latency testing?

The reason I'm asking about all of this is that I've been investigating the impact of various factors on low latency workloads. From my various tests, it turned out that the Go language itself (even a very simple "hello world" program, designed to be as single-threaded as possible with GC turned off) is causing interrupts on isolated cores. Meanwhile, a "similar" application written in Python doesn't do this.

I've submitted results of this investigation to different communities, you can check following issue/questions if you want:

https://github.com/golang/go/issues/60735
https://bugs.launchpad.net/ubuntu/+source/linux-lowlatency/+bug/2023391
https://stackoverflow.com/questions/76460740/how-to-get-rid-of-interrupts-on-isolated-cores-caused-by-simple-go-app-during

Aug 24 '23 20:08 p-zak

@alysondeives,

Were you able to collect the information above? Can you also re-test on the latest version of Telegraf please?

Thanks!

May 07 '24 15:05 powersj

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

May 21 '24 18:05 telegraf-tiger[bot]