opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[hostmetricsreceiver] Add important per-process counters

Open dgcom opened this issue 3 years ago • 9 comments

Is your feature request related to a problem? Please describe. There are several very important per-process metrics which are not yet collected by host metrics receiver, for example:

  • process thread count
  • process open handles count
  • process open file descriptor count

These can be considered process golden metrics and are needed for most troubleshooting and trend analysis to make sure there are no threads/handles leaks in the process.

Describe the solution you'd like Collect and emit at least metrics mentioned above. The ideal solution - collect more per-process metrics (optionally) - include those which are being collected by leading infrastructure monitoring tools on the market.

Describe alternatives you've considered I have analyzed per-process metrics collected by such competitor tools like New Relic and Data Dog and their infrastructure agents are able to collect these metrics, however I would like to use OTEL collector as unified agent instead.

Additional context There is a (great) trend to switch to OTEL host metrics receiver for infrastructure monitoring (ex. Signoz, Splunk Observability, New Relic etc.) and if such tools utilize same host metrics receiver, they will all miss very important and useful metrics making troubleshooting and observability much harder.

dgcom avatar Jul 15 '22 04:07 dgcom

Pinging code owners: @dmitryax

github-actions[bot] avatar Jul 15 '22 17:07 github-actions[bot]

@dgcom is this something you plan to work on? If so I will assign the issue to you.

TylerHelmuth avatar Jul 15 '22 17:07 TylerHelmuth

@dgcom is this something you plan to work on? If so I will assign the issue to you.

I would love to, but I don't have enough time and skills in Go currently to contribute...

dgcom avatar Jul 15 '22 17:07 dgcom

@TylerHelmuth I can take this one.

evan-bradley avatar Jul 15 '22 17:07 evan-bradley

@evan-bradley it's yours.

TylerHelmuth avatar Jul 15 '22 17:07 TylerHelmuth

process open handles count

@dgcom Just to clarify, are you talking about the Windows concept of a process handle? If so, I do not believe the library that the hostmetricsreceiver uses to gather process data currently supports getting this information.

The other metrics can be easily scraped. I will be adding voluntary and involuntary context switch counts and a open file descriptor count.

evan-bradley avatar Aug 03 '22 19:08 evan-bradley

For Windows, handles count is "\Process(*)\Handle Count" perfmon counter. In PowerShell this is available with this example:

# All processes
get-counter "\Process(*)\Handle Count"
# Specific process
get-counter "\Process(explorer)\Handle Count"
# List all available counters for processes
(Get-Counter -ListSet Process).Paths

For thread count, it is "\Process(*)\Thread Count" Windows does not have file descriptors counter, so this should be available only for Linux.

Looking at the library used by the receiver - leoluk/perflib_exporter: perflib-based Prometheus exporter for Windows and low-level Go perflib library - I don't see a reason why it wouldn't be able to retrieve available counters...

dgcom avatar Aug 03 '22 20:08 dgcom

Thank you for the clarification. Most process metrics are generated using data obtained from gopsutil, which is the library I was referring to that doesn't yet support getting a process handle count.

It does look like perflib_exporter should be able to retrieve this information. I have limited working knowledge around Windows and do not have a Windows environment readily available to test with, so someone else will have to implement that metric within the hostmetricsreceiver.

evan-bradley avatar Aug 04 '22 14:08 evan-bradley

I looked at gopsutil and it does not use performance counters at all, which explains why it only supports cpu, memory and limited number of IO counters. The best option would be to change process scraper implementation to use perflib_exporter, which provides more per-process data and that data is compatible with many other Windows monitoring implementations. And I know it is hard to write such low-level cross-platform implementations...

dgcom avatar Aug 04 '22 16:08 dgcom

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • receiver/hostmetrics: @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Nov 10 '22 03:11 github-actions[bot]

What a coincidence - I was actually checking changes in hostmetrics receiver when the bot posted 60 day notice...

I see that process.open_file_descriptors and process.threads are now available: opentelemetry-collector-contrib/documentation.md at main · open-telemetry/opentelemetry-collector-contrib

But process handles seems to be missing...

dgcom avatar Nov 10 '22 04:11 dgcom

@dgcom I wasn't able to add process handles as part of my work, as I don't have a Windows environment to test with. I will leave this issue available for someone else to pick that up.

evan-bradley avatar Nov 10 '22 14:11 evan-bradley

@evan-bradley Ok, that's fine, thank you for covering Linux side of things! I'll see if I finally get some time to dig into this myself by the end of the year... Unless someone else will be kind enough to pick this up before that.

dgcom avatar Nov 10 '22 18:11 dgcom

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • receiver/hostmetrics: @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Jan 10 '23 03:01 github-actions[bot]

This issue has been inactive for 60 days.

I strongly believe that we should keep this open until it is fully resolved.

dgcom avatar Jan 11 '23 17:01 dgcom

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • receiver/hostmetrics: @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Apr 10 '23 03:04 github-actions[bot]

This issue is still relevant and should be kept open until it is resolved.

dgcom avatar Apr 10 '23 13:04 dgcom

I would like handle metrics for Windows as well. I have issue #21379 open for this along with PR #22813 that adds support for a Windows exclusive process.handles metric. It doesn't use the performance counter but instead uses NtQuerySystemInformation. This solution does result in only one new syscall per-scrape which is why I chose that, but perhaps the performance counter would be preferred for simplicity.

braydonk avatar May 26 '23 19:05 braydonk

I ended up changing the PR to use a WMI query instead and it ended up being the simplest way to do it. The PR is still waiting on a review at this stage.

braydonk avatar Jun 21 '23 14:06 braydonk

This is great news! Hope we'll close this out once PR is merged...

dgcom avatar Jun 21 '23 23:06 dgcom

The new process.handles metric is in v0.81!

braydonk avatar Jul 05 '23 15:07 braydonk

The new process.handles metric is in v0.81!

Great, now need to test it out!

dgcom avatar Jul 05 '23 17:07 dgcom

I tested 0.8.1 and I can see threads and handles counts in Windows - this is great!

This issue can be closed now.

dgcom avatar Jul 07 '23 05:07 dgcom