opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
[hostmetricsreceiver] Add important per-process counters
Is your feature request related to a problem? Please describe. There are several very important per-process metrics which are not yet collected by host metrics receiver, for example:
- process thread count
- process open handles count
- process open file descriptor count
These can be considered process golden metrics and are needed for most troubleshooting and trend analysis to make sure there are no threads/handles leaks in the process.
Describe the solution you'd like Collect and emit at least metrics mentioned above. The ideal solution - collect more per-process metrics (optionally) - include those which are being collected by leading infrastructure monitoring tools on the market.
Describe alternatives you've considered I have analyzed per-process metrics collected by such competitor tools like New Relic and Data Dog and their infrastructure agents are able to collect these metrics, however I would like to use OTEL collector as unified agent instead.
Additional context There is a (great) trend to switch to OTEL host metrics receiver for infrastructure monitoring (ex. Signoz, Splunk Observability, New Relic etc.) and if such tools utilize same host metrics receiver, they will all miss very important and useful metrics making troubleshooting and observability much harder.
Pinging code owners: @dmitryax
@dgcom is this something you plan to work on? If so I will assign the issue to you.
@dgcom is this something you plan to work on? If so I will assign the issue to you.
I would love to, but I don't have enough time and skills in Go currently to contribute...
@TylerHelmuth I can take this one.
@evan-bradley it's yours.
process open handles count
@dgcom Just to clarify, are you talking about the Windows concept of a process handle? If so, I do not believe the library that the hostmetricsreceiver uses to gather process data currently supports getting this information.
The other metrics can be easily scraped. I will be adding voluntary and involuntary context switch counts and a open file descriptor count.
For Windows, handles count is "\Process(*)\Handle Count" perfmon counter. In PowerShell this is available with this example:
# All processes
get-counter "\Process(*)\Handle Count"
# Specific process
get-counter "\Process(explorer)\Handle Count"
# List all available counters for processes
(Get-Counter -ListSet Process).Paths
For thread count, it is "\Process(*)\Thread Count" Windows does not have file descriptors counter, so this should be available only for Linux.
Looking at the library used by the receiver - leoluk/perflib_exporter: perflib-based Prometheus exporter for Windows and low-level Go perflib library - I don't see a reason why it wouldn't be able to retrieve available counters...
Thank you for the clarification. Most process metrics are generated using data obtained from gopsutil, which is the library I was referring to that doesn't yet support getting a process handle count.
It does look like perflib_exporter should be able to retrieve this information. I have limited working knowledge around Windows and do not have a Windows environment readily available to test with, so someone else will have to implement that metric within the hostmetricsreceiver.
I looked at gopsutil and it does not use performance counters at all, which explains why it only supports cpu, memory and limited number of IO counters. The best option would be to change process scraper implementation to use perflib_exporter, which provides more per-process data and that data is compatible with many other Windows monitoring implementations. And I know it is hard to write such low-level cross-platform implementations...
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
- receiver/hostmetrics: @dmitryax
See Adding Labels via Comments if you do not have permissions to add labels yourself.
What a coincidence - I was actually checking changes in hostmetrics receiver when the bot posted 60 day notice...
I see that process.open_file_descriptors and process.threads are now available:
opentelemetry-collector-contrib/documentation.md at main · open-telemetry/opentelemetry-collector-contrib
But process handles seems to be missing...
@dgcom I wasn't able to add process handles as part of my work, as I don't have a Windows environment to test with. I will leave this issue available for someone else to pick that up.
@evan-bradley Ok, that's fine, thank you for covering Linux side of things! I'll see if I finally get some time to dig into this myself by the end of the year... Unless someone else will be kind enough to pick this up before that.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
- receiver/hostmetrics: @dmitryax
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days.
I strongly believe that we should keep this open until it is fully resolved.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
- receiver/hostmetrics: @dmitryax
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue is still relevant and should be kept open until it is resolved.
I would like handle metrics for Windows as well. I have issue #21379 open for this along with PR #22813 that adds support for a Windows exclusive process.handles metric. It doesn't use the performance counter but instead uses NtQuerySystemInformation. This solution does result in only one new syscall per-scrape which is why I chose that, but perhaps the performance counter would be preferred for simplicity.
I ended up changing the PR to use a WMI query instead and it ended up being the simplest way to do it. The PR is still waiting on a review at this stage.
This is great news! Hope we'll close this out once PR is merged...
The new process.handles metric is in v0.81!
The new
process.handlesmetric is in v0.81!
Great, now need to test it out!
I tested 0.8.1 and I can see threads and handles counts in Windows - this is great!
This issue can be closed now.