opentelemetry-python-contrib icon indicating copy to clipboard operation
opentelemetry-python-contrib copied to clipboard

Fix instrument types in system metrics

Open bjrara opened this issue 1 year ago • 2 comments

Description

This PR fixes the instrument types when creating the system metrics.

Metrics that are impacted:

  • system.network.connections: Collected from psutil.net_connections is a snapshot of the current state of network connections, thus should be delta and is not monotonic.
  • process.runtime.cpython.gc_count Collected from python gc is the current collection counts, thus should be delta and is not monotonic.
  • process.runtime.cpython.memory Collected from psutil.Process.memory_info is a snapshot of the process’s memory usage at the time of the call, thus should be delta and is not monotonic.
  • process.runtime.cpython.thread_count Collected from psutil.Process.num_threads is the number of threads currently used by this process (non cumulative).
  • process.open_file_descriptor.count Collected from psutil.Process.num_fds is the number of file descriptors currently opened by this process (non cumulative).

Change the instrument types for the metrics above to observable_gauge for correction.

Fixes https://github.com/open-telemetry/opentelemetry-python-contrib/issues/2861

Type of change

Please delete options that are not relevant.

  • [ ] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Tested locally using console exporter configured with delta temporality for all the metric types.

Does This PR Require a Core Repo Change?

  • [ ] No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • [ ] Followed the style guidelines of this project
  • [ ] Changelogs have been updated
  • [ ] Unit tests have been added
  • [ ] Documentation has been updated

bjrara avatar Sep 11 '24 23:09 bjrara

Don't have much clue about metrics but the generated semantic conventions package provides some helper functions to creates these metrics, and they return UpDownCounter, e.g. create_process_open_file_descriptor_count

xrmx avatar Sep 12 '24 08:09 xrmx

@xrmx @emdneto Thanks for taking a look at the PR.

The issue is not with UpDownCounter, but the temporality used by ObservableUpDownCounter.

From the definition, ObservableUpDownCounter is cumulative, while these metrics are delta.

if isinstance(instrument, ObservableUpDownCounter):
        return _SumAggregation(
            attributes,
            instrument_is_monotonic=False,
            instrument_aggregation_temporality=(
                AggregationTemporality.CUMULATIVE
            ),
            start_time_unix_nano=start_time_unix_nano,
        )

Changing them from ObservableUpDownCounter to UpDownCounter should also fix the issue. ~~I agree with both of you that using UpDownCounter is more accurate from the semantic convention.~~ However, create_up_down_counter isn't asynchronous, and doesn't accept callbacks, so I choose to use ObservableGauge. ~~If you have better suggestions on how to fix the issue properly with UpDownCounter, please let me know.~~

I actually want to take back my words on using UpDownCounter. By checking the specification of UpDownCounter, it's more appropriate to use when the metrics are emitted using add(n) or add(-n). But in our case, we collect absolute values from snapshots.

I understand it deviates from the semantic conventions, but I also want to ask for a second thought based on the actual collection process.

bjrara avatar Sep 12 '24 15:09 bjrara

If there is already another language going out of spec we should open an issue in the semconv repo, no?

xrmx avatar Sep 24 '24 20:09 xrmx

Taking a second look, the issue makes sense. According to the specs, UpDownCounter is suitable because the metrics are additive and not monotonically increasing, but in practice, for our asynchronous measurements (connections, memory, thread_count, etc), the cumulative temporality of the ObservableUpDownCounter doesn't seem to fit in that situation. We want to report an absolute value (current snapshot) that is not monotonic and non-cumulative.

In java it's using ObservableGauge.

@bjrara, would you be open to adding a failing test showing how this is fixing the issue with the cumulative temporality?

There's also another issue: The specs do not reference the gc_count metric, but I totally agree with that change (from async counter to async gauge).

Thanks for your feedback. Sure, checking how to add a failing test on the issue.

bjrara avatar Sep 24 '24 22:09 bjrara

As discussed in SIG (09/26) -- @bjrara will try using the LastValue aggregation in a View to configure the SDK to output this metric as a gauge. If it doesn't work, we should bring the issue to the semconv SIG.

emdneto avatar Sep 26 '24 17:09 emdneto

As discussed in SIG (09/26) -- @bjrara will try using the LastValue aggregation in a View to configure the SDK to output this metric as a gauge. If it doesn't work, we should bring the issue to the semconv SIG.

The proposed solution works for me. Thanks to @emdneto and @aabmass for the suggestions. I will update the issue with the summary why we believe the existing instrument types are right, and how to work around to report metrics to systems that only accept delta.

bjrara avatar Sep 27 '24 17:09 bjrara