semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

Add initial experimental .NET CLR runtime metrics

Open stevejgordon opened this issue 1 year ago • 37 comments

Fixes #956

Changes

Adds proposed experimental .NET CLR runtime metrics to the semantic conventions. Based on discussions with the .NET runtime team, the implementation plan will be to port the existing metrics from OpenTelemetry.Instrumentation.Runtime as directly as possible into the runtime itself. The names have been modified to align with the runtime environment metrics conventions.

Merge requirement checklist

stevejgordon avatar May 14 '24 10:05 stevejgordon

I'm not entirely sure why markdownlint fails on the autogenerated tables.

stevejgordon avatar May 14 '24 10:05 stevejgordon

Have you run both attribute-registry-generation and table-generation?

trisch-me avatar May 14 '24 14:05 trisch-me

Have you run both attribute-registry-generation and table-generation?

I thought I had done so as the contents were updated, but I can try those again. I'm having to hack around the tooling a bit to get things running on Windows.

stevejgordon avatar May 14 '24 15:05 stevejgordon

@trisch-me I've updated the formatting per your suggestion and rerun both of those make targets. Neither made any changes to the markdown files.

stevejgordon avatar May 14 '24 15:05 stevejgordon

Alternatively you could wait until #1000 will be merged, it has fix for this bug

trisch-me avatar May 15 '24 11:05 trisch-me

@stevejgordon The #1000 is merged so please re-run code generation locally and update your files. Thanks

trisch-me avatar May 15 '24 12:05 trisch-me

Thanks, @trisch-me. Looks good!

stevejgordon avatar May 15 '24 13:05 stevejgordon

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: stevejgordon / name: Steve Gordon (54833c468b85891622406d22108eff404ad8adbe, cdbef6a50a6442c53dff59ec42281ef5e60934be, 630ee50a381b41c38566c4eb86f1b015d4edcffe, 1a274040ac202430bb6477de7179c762b027495a, a4ad8952b6db4ab52974340c571a4d5080bcadb1, 6b0a6e942d5cfd962a1d49e1e75c1e3c21829ab6, 1aefc2939fc234939055c446af7151916920e2ec, dc63af3cb86e01462e5766a3165572f46a15aadc, a5da35ca68ae9c8d022ad2d55b74cb863f9e541a, 5d63a8f536b903c01554d6b5248809afb6495acf, c80a739bf579b754bba8e84b3c4c4a14bd89730c, 8717a4007e9781af7fcff748b53c138894d5f001, 3d14868285b7ed7efd72f288293ad966fab78103, b4b98c08d3ec10cd4f86ac837088373dc894c40a, d08792a53295d387a6bedfc4f428f0bbf4927180, aa4eb335e094965f7d2ee7b54a7a91225222ea08, c48510fc7a10b3d0d15011224a30914f5ceb79d8, 4ae73ed3fec0f1b53dae188141acd68e945f4ba6, 88fd58c4985f5ea83964a93a6369d2ad2afa7fc3, 72c16570cddd5daf0c455d7afae9ff9fe5dc4da4, fe47a3123adb5b04154c5d3eb070bb5211a67e1c, 90a9356bbc8dea9cfc23285578f8b9cadfe8c98f, f8866a1e446d0201c033e6b973bdeafc2b430f66, 866a85cc68b249e0efc4f945cd9cd2dfeba2537f, 8f95fd10110aea0ec9b7c13bd32ad6ea20f3b282, c6b3628e47ff8ef67ab1d3f46ca021b22e5a0de3, 6b7a0ed35e88a047b85e727b252a0a267ad68de6, 8cc317588b1f10be8429d6c524d09f1ca0002772, 8e90ea0647703cb54f79b8a669c14ae906b32992, f1dd3dd8a3f0e84e9137994fb14ec3439d17aa5e, eab22b20ab57c0ca0bfffaebcaf2f08b15eabca3, 085e33196ed1351ba5bd2de9598f3228dad3386e, fbd75f25d692155fc707796e5e3b01c069a7cf6e)
  • :white_check_mark: login: lmolkova / name: Liudmila Molkova (d104216ab578424673a4ee4b1a6acaaf199a236b, 47ed007e1b29ba92c02b1b893d9ad61b14c682fc, 6dcfd10525f3bc316e4a5b7b3cbb90455127e9fe)

/easycla

trask avatar May 28 '24 18:05 trask

Looking at the discussions in this PR, I want to reiterate the proposal https://github.com/open-telemetry/semantic-conventions/pull/1035#pullrequestreview-2077532984:

Let's think about user experience - which metrics users want to see first - my assumption is CPU, memory (from all sources, ideally in one/few metrics, GC, maybe threads. Let try to come up with a few metrics that would not require users to have a deep prior expertise in .NET memory management or know subtle differences between different .NET flavors.

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things. Let's focus on basics first.

lmolkova avatar Jun 03 '24 18:06 lmolkova

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things. Let's focus on basics first.

Are you suggesting lets review some things first and some things later? Or do you mean "Lets see if we can avoid including some of these metrics in .NET 9 at all?" If its just review ordering I'd say no problem. If the goal is to have these metrics not contain all the metrics in the current OTel .NET runtime instrumentation I think that leads to trouble. One of the major scenarios will be people migrating from using older metrics to these metrics and every metric we remove makes that migration harder or discourages them from migrating at all. If we want to have a simple set of metrics for folks just getting started I think a good approach to that will be docs or a pre-made dashboard, not by excluding metrics from the underlying instrumentation.

EDIT: Just to add I know a few of the metrics may feel a bit advanced or niche, but they are there because customer feedback wanted them to be there. We took a bunch of things out during the move from Windows Performance Counters -> EventCounters, but customers gave us feedback that we had cut too deep and please restore some of the metrics that were important to them.

noahfalk avatar Jun 03 '24 23:06 noahfalk

on reducing the scope:

@noahfalk thanks for the explanation. If it's either all or nothing, I believe we should still start from the basics (and try to do everything in .NET 9).

The main concerns on the basics from my side are:

  • CPU - https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1626774736 - seems trivial, just needs to be defined.
  • Memory: https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1613860662 and https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1626780003

The rest seems more cosmetic based on our discussions on duration/histograms in the https://github.com/dotnet/runtime/issues/85372#issuecomment-2138800897

lmolkova avatar Jun 05 '24 01:06 lmolkova

The main concerns on the basics from my side are...

Thanks @lmolkova, much appreciated! I'm just waiting to make sure I've got the right idea about the cpu portion of the discussion and then I'll respond back on the memory part as well.

noahfalk avatar Jun 07 '24 06:06 noahfalk

@noahfalk / @lmolkova - I've returned to the office and updated some core changes discussed above.

  • clr.* is now dotnet.*
  • filenames are updated to align with the clr > dotnet change
  • I've added a dotnet.exception.type attribute to the registry
  • I've reintroduced details in dotnet-metrics.md about the equivalent API used to collect the metric value.

@trisch-me One issue, if we stick with the updated file naming, is docs/attributes-registry/dotnet.md where ideally, the title would be .NET rather than Dotnet. Is there a way to configure that?

stevejgordon avatar Jun 18 '24 11:06 stevejgordon

@stevejgordon not in current implementation but as soon as this issue is implemented, which should be soon, you will be able to add to the dotnet group a title as .Net

trisch-me avatar Jun 18 '24 14:06 trisch-me

@noahfalk / @lmolkova - I've returned to the office and updated some core changes discussed above.

Thanks and welcome back! I think there were two other areas where change has been proposed and not yet captured in updated docs:

  1. Add a dotnet.cpu.time metric to provide CPU usage info: https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1634044798
  2. Add either a dotnet.heap.used or dotnet.memory.used: https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1634044832 This memory one seemed the roughest at the moment. As best I can tell dotnet.memory.used is a super-set of dotnet.heap.used so perhaps that is the only one we'd need? Presumably adding either of these metrics would effectively replace/rename data that is currently exposed in dotnet.gc.committed_memory.size and dotnet.gc.heap.size. I don't think it was a replacement for any of the other gc size-related metrics.

@lmolkova - I'm still interested to get your feedback on those memory related metrics trying to refine what would be there :)

Thanks!

noahfalk avatar Jun 19 '24 10:06 noahfalk

  1. Add a dotnet.cpu.time metric to provide CPU usage info: Add initial experimental .NET CLR runtime metrics #1035 (comment)

Yep, I still need to add dotnet.cpu.time and dotnet.cpu.count. The first depends on #1026 being merged, so I've waited for now.

stevejgordon avatar Jun 21 '24 11:06 stevejgordon

Thanks @stevejgordon I think we're getting pretty close!

Regarding CPU - the https://github.com/open-telemetry/semantic-conventions/pull/1026 is merged. But it'd be totally fine to send CPU as a separate PR.

lmolkova avatar Jun 28 '24 02:06 lmolkova

No problem, @lmolkova. Thanks again for the feedback. I've updated per your suggestions and also snuck in the new CPU metrics.

stevejgordon avatar Jun 28 '24 09:06 stevejgordon

@lmolkova I'm thinking dotnet.thread_pool.queue.length unit should be {work_item} given what the queue length represents. Would you concur?

stevejgordon avatar Jul 01 '24 15:07 stevejgordon

@lmolkova I'm thinking dotnet.thread_pool.queue.length unit should be {work_item} given what the queue length represents. Would you concur?

👍

lmolkova avatar Jul 01 '24 17:07 lmolkova

I've opened a PR for the implementation at https://github.com/dotnet/runtime/pull/104680. It's a draft for now while we finalise the last naming discussion. We'd need to get this approved quite soon to meet the .NET 9 deadline if any changes are needed here.

stevejgordon avatar Jul 10 '24 15:07 stevejgordon

I'm declaring comment bankruptcy and will try to list active discussions here:

  • [x] codeowners (trivial) - https://github.com/open-telemetry/semantic-conventions/pull/1035/files#r1672592867
  • [x] exceptions.count -> exceptions (trivial) https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672597984
  • [x] heap.size -> last_collection_size (some discussion) https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672029675
  • [x] memory overall (big discussion) https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672057241
    • [x] remove gc.heap.objects.size - https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672006570

(might add more if I see them)

lmolkova avatar Jul 10 '24 17:07 lmolkova

[Update] based on the offline discussion image

lmolkova avatar Jul 10 '24 18:07 lmolkova

Discussed offline with @noahfalk @tarekgh @samsp-msft @reyang

Suggesting the following changes:

  • heap renames (https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672029675)
    • [x] rename dotnet.gc.memory.total_allocated -> dotnet.gc.heap.total_allocated
    • [x] rename dotnet.gc.memory.committed -> dotnet.gc.last_collection.memory.committed_size
    • [x] rename dotnet.gc.heap.size -> dotnet.gc.last_collection.heap.size
    • [x] rename dotnet.gc.heap.fragmentation -> dotnet.gc.last_collection.heap.fragmentation.size
  • [x] remove dotnet.gc.heap.objects.size (https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672006570)
  • [x] add dotnet.process.memory.working_set (https://github.com/open-telemetry/semantic-conventions/pull/1035#discussion_r1672072345)
  • cpu
    • [x] rename dotnet.cpu.time -> dotnet.process.cpu.time
    • [x] rename dotnet.cpu.count -> dotnet.process.cpu.count
      • [follow up in semconv and/or in .NET docs, not blocking this PR] we'll need to document it well since it's not perfect
        • how docker/k8s limits are translated (rounding up)
        • is it dynamic?
        • describe how to use (e.g. document how to calculate cpu utilization from time and count)
  • [x] please also document meter name(s)

lmolkova avatar Jul 10 '24 21:07 lmolkova

Suggesting the following changes:

Thanks! I'm also checking with Maoni offline to ensure she doesn't think anything in our new naming is problematic, but given past discussions with her I think she is going to be happy.

noahfalk avatar Jul 10 '24 22:07 noahfalk

Thanks! I'm also checking with Maoni offline to ensure she doesn't think anything in our new naming is problematic, but given past discussions with her I think she is going to be happy.

Yep, all good with the names as proposed.

noahfalk avatar Jul 11 '24 06:07 noahfalk

Thanks, @lmolkova. I've pushed up all the suggested changes.

stevejgordon avatar Jul 11 '24 07:07 stevejgordon

Given that we have an implementation ready to go, should we consider adding these as stable?

stevejgordon avatar Jul 11 '24 07:07 stevejgordon

Given that we have an implementation ready to go, should we consider adding these as stable?

Let's add them as experimental for now, we can mark them as stable when .NET 9 is GA-ed.

lmolkova avatar Jul 11 '24 22:07 lmolkova