Cyclictest utility reports latency increase when telegraf is running with intel-powerstat plugin on preempt-rt kernel
Relevant telegraf.conf
[agent]
collection_jitter = "0s"
debug = false
flush_interval = "10s"
flush_jitter = "0s"
hostname = "$HOSTNAME"
interval = "10s"
logfile = ""
metric_batch_size = 1000
metric_buffer_limit = 10000
omit_hostname = false
precision = ""
quiet = false
round_interval = true
[[outputs.prometheus_client]]
listen = ":9273"
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[inputs.linux_cpu]]
metrics = ["cpufreq"]
interval = "30s"
[[inputs.ipmi_sensor]]
interval = "30s"
timeout = "20s"
metric_version = 1
use_sudo = false
[[inputs.intel_powerstat]]
cpu_metrics = ["cpu_frequency", "cpu_busy_frequency", "cpu_temperature", "cpu_c0_state_residency","cpu_c1_state_residency", "cpu_c6_state_residency", "cpu_busy_cycles"]
package_metrics = ["current_power_consumption", "current_dram_power_consumption" ,"thermal_design_power", "cpu_base_frequency" ,"uncore_frequency"]
[[inputs.mem]]
Logs from Telegraf
This issue does not show any relevant information on logs
System info
Telegraf 1.27.3, Debian 11, linux-yocto-rt kernel 5.10 with upstream patches
Docker
No response
Steps to reproduce
- Use a preempt-rt kernel on a linux distribution
- Install rt-tests package from your linux distribution (or compile it from source at https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git)
- With the system on steady state, execute cyclictest utility from a long period of time (at least 6 hrs) and save latency output data for reference.
- Start telegraf without intel-powerstat plugin enabled
- Execute cyclictest utility and save latency output data
- Start telegraf with intel-powerstat plugin enabled
- Execute cyclictest utility and save latency output data
Expected behavior
Latencies from telegraf execution with and without intel-powerstat plugin should not differ so much
Actual behavior
There is a huge difference on latencies when telegraf is running with intel-powerstat plugin
Additional info
Further investigation indicated this occurs because of the concurrent read of MSR with goroutines.
Hi,
There is a huge difference on latencies between steady state and telegraf execution
Can you quantify what this actually looks like in practice?
Latencies from steady state system and telegraf execution should not differ so much
You are comparing an idle system to one that is actively getting metrics about the system every 10 seconds. I would expect there to be differences. As is, I am not clear on what the actionable item is with this issue.
@zak-pawel any additional thoughts?
Hi @powersj, I missed to inform another test case, that is telegraf without intel-powerstat plugin, sorry. I will edit the issue to make it clear.
If you compare telegraf with and without the plugin, it is possible to notice a huge increase on latency.
I already performed an investigation, and the issue occurs on the concurrent read of MSR files using go routines. Pull request #13829 fix it.
@alysondeives,
Thanks for the details and more importantly the PR!
I'll have the Intel folks review it as well. I would be curious to know if they had a reason to grab all the metrics with go routines versus doing it one at a time. Additionally, it would be good to know if there is any impact to the validity of the metrics to ensure we are not changing any behavior for end users. I assume the answer is no, but I'd like to get an answer from them as well.
Thanks again
Hello @alysondeives, thanks for the issue and PR.
There is a particular reason to grab all the metrics with go routines: we want to read values from MSRs (per particular core) as closely to each other as possible. Why? In order for our calculations to be as precise as possible. And we are closest to this situation precisely when various counters are read from MSRs as closely to each other as possible.
As you can see in code:
- for
CPU Busy Frequencymetric we need to read values frommaximumFrequencyClockCountLocation,actualFrequencyClockCountLocationandtimestampCounterLocationto do proper calculations. - for
CPU C0 State residencymetric we need to read values frommaximumFrequencyClockCountLocationandtimestampCounterLocationto do proper calculations. - for
CPU C1 State Residencymetric we need to read values frommaximumFrequencyClockCountLocation,c3StateResidencyLocation,c6StateResidencyLocation,c7StateResidencyLocationandtimestampCounterLocationto do proper calculations. - for
CPU C6 State Residencymetric we need to read values fromc6StateResidencyLocationandtimestampCounterLocationto do proper calculations.
Given all the above, I'm ok with reading MSRs sequentially as long as:
- it will depend on config parameter, for example:
read_methodwith possible values:concurrentorsequential - default will be:
concurrent(to have calculations as precise as possible by default and also for backward compatibility) - it will be very well described in
README.md,sample.confand code to show user pros and cons of each option- please search for all occurrences of
read_methodin https://github.com/influxdata/telegraf/blob/master/plugins/inputs/smart/README.md to see how it can be done
- please search for all occurrences of
We have been working on additional features for this plugin (changes will be upstreamed in few weeks), and one of them will be core/cpu mask in config which will allow user to exclude particular cores from MSR reads. Thanks to that user will be able to exclude isolated cores on which sensitive, lowlatency workload is run from "expensive MSR reads".
There is a huge difference on latencies when telegraf is running with intel-powerstat plugin
@alysondeives
I would love to see both results.
Also, can you please share:
- Model(s) of CPU on which you performed these tests
- OS:
uname -a - Output from
lscpu - Kernel parameters:
cat /proc/cmdline - Did you disable
irqbalance.servicebefore running benchmarks?- Can you paste output from
systemctl list-unit-files | grep -e "enabled" -e "UNIT FILE"?
- Can you paste output from
- Did you run
cyclicteston isolated cores? - Did you tune your environment properly for low latency testing?
The reason I'm asking about all of this is that I've been investigating the impact of various factors on low latency workloads. From my various tests, it turned out that the Go language itself (even a very simple "hello world" program, designed to be as single-threaded as possible with GC turned off) is causing interrupts on isolated cores. Meanwhile, a "similar" application written in Python doesn't do this.
I've submitted results of this investigation to different communities, you can check following issue/questions if you want:
- https://github.com/golang/go/issues/60735
- https://bugs.launchpad.net/ubuntu/+source/linux-lowlatency/+bug/2023391
- https://stackoverflow.com/questions/76460740/how-to-get-rid-of-interrupts-on-isolated-cores-caused-by-simple-go-app-during
@alysondeives,
Were you able to collect the information above? Can you also re-test on the latest version of Telegraf please?
Thanks!
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!