dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

chore(internal): improve performance of telemetry dependency reporting

Open brettlangdon opened this issue 1 year ago • 4 comments

This change:

  • Removes the usage of TelemetryModuleWatchdog
    • A strategy of diffing sys.modules.keys() is now used instead of watching for imports.
  • Removes tracking of module origin
    • Since we are only interested in distribution name + version we don't need to keep track of all imported module origins, and then look up distributions from that. We can use import names to directly look up the distribution they are apart of.
  • Add support for namespace packages
    • Added the use of importlib_metdata.packages_distributions() to ensure we are able to resolve namespace packages to their actual distribution name.
    • We only add importlib_metadata for Python <3.8, but packages_distributions was added in 3.10, the footprint is fairly small, so it is just vendored for the cases when we don't have it available (keeps importlib_metadata requirement fairly broad as well)

The main change here is to have the TelemetryWriter keep track of which sys.modules.keys() we have already resolved and sent information about. Then on each periodic run, compare current sys.modules.keys() against the cache, and send any new modules.

We also are no longer using origins to try and resolve the distributions for a given module, instead we use importlib_metadata.distribution(name), if that fails to resolve, then we will check to see if that import name is mapped to any namespace packages, which we then try grab the distributions for that. (this means any given module name could resolve to multiple distributions to report about).

I have really struggled to get a solid benchmark/performance measurement going, the only details I have are the internal app I was testing against went from 100s to about 2s of overhead added for dependency resolution, and the test suite itself still validates that the previous behavior is occurring.

Checklist

  • [x] Change(s) are motivated and described in the PR description
  • [x] Testing strategy is described if automated tests are not included in the PR
  • [x] Risks are described (performance impact, potential for breakage, maintainability)
  • [x] Change is maintainable (easy to change, telemetry, documentation)
  • [x] Library release note guidelines are followed or label changelog/no-changelog is set
  • [x] Documentation is included (in-code, generated user docs, public corp docs)
  • [x] Backport labels are set (if applicable)
  • [x] If this PR changes the public interface, I've notified @DataDog/apm-tees.

Reviewer Checklist

  • [x] Title is accurate
  • [x] All changes are related to the pull request's stated goal
  • [x] Description motivates each change
  • [x] Avoids breaking API changes
  • [x] Testing strategy adequately addresses listed risks
  • [x] Change is maintainable (easy to change, telemetry, documentation)
  • [x] Release note makes sense to a user of the library
  • [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • [x] Backport labels are set in a manner that is consistent with the release branch maintenance policy

brettlangdon avatar Jun 11 '24 19:06 brettlangdon

Datadog Report

Branch report: brettlangdon/deps.perf Commit report: bb4d787 Test service: dd-trace-py

:white_check_mark: 0 Failed, 177112 Passed, 1882 Skipped, 10h 54m 12.29s Total duration (19m 51.06s time saved) :snowflake: 1 New Flaky :hourglass: 2 Performance Regressions

New Flaky Tests (1)

  • test_connector[http] - test_utils_http.py - Last Failure

    Expand for error
    Errno 99] Cannot assign requested address
    

:hourglass: Performance Regressions vs Default Branch (2)

  • test_tracer_usage_multiprocess - test_rand.py 7m 1.49s (+6m 57.58s, +10696%) - Details
  • test_logger_handle_debug - test_logger.py 2.74s (+2.24s, +447%) - Details

Codecov Report

Attention: Patch coverage is 0% with 144 lines in your changes missing coverage. Please review.

Project coverage is 10.51%. Comparing base (c728c68) to head (cb1e56d). Report is 16 commits behind head on main.

Files Patch % Lines
ddtrace/internal/packages.py 0.00% 57 Missing :warning:
ddtrace/internal/telemetry/modules.py 0.00% 56 Missing :warning:
ddtrace/internal/telemetry/data.py 0.00% 12 Missing :warning:
tests/telemetry/test_writer.py 0.00% 9 Missing :warning:
ddtrace/internal/telemetry/writer.py 0.00% 6 Missing :warning:
tests/telemetry/test_data.py 0.00% 4 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #9515       +/-   ##
===========================================
- Coverage   74.30%   10.51%   -63.79%     
===========================================
  Files        1398     1367       -31     
  Lines      129930   127823     -2107     
===========================================
- Hits        96541    13439    -83102     
- Misses      33389   114384    +80995     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jun 12 '24 07:06 codecov-commenter

Benchmarks

Benchmark execution time: 2024-08-27 14:29:13

Comparing candidate commit 857676eb30dd6f5bef77e01d6ef3e0a5bc19e5cd in PR branch brettlangdon/deps.perf with baseline commit 53443006f51c26d6984929915112ee14503b7be8 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 214 metrics, 2 unstable metrics.

pr-commenter[bot] avatar Jun 12 '24 13:06 pr-commenter[bot]

I guess this was some CI hang, but could you double check, @brettlangdon ?

test_tracer_usage_multiprocess - test_rand.py 7m 1.49s (+6m 57.58s, +10696%) - Details

gnufede avatar Jun 13 '24 13:06 gnufede

Datadog Report

Branch report: brettlangdon/deps.perf Commit report: 857676e Test service: dd-trace-py

:white_check_mark: 0 Failed, 142960 Passed, 1670 Skipped, 8h 37m 58.23s Total duration (14.62s time saved) :hourglass: 1 Performance Regression

:hourglass: Performance Regressions vs Default Branch (1)

  • test_data_streams_payload_size[key_and_length2-payload_and_length2] - test_kafka.py 45s (+41.97s, +1383%) - Details

CODEOWNERS have been resolved as:

ddtrace/internal/telemetry/modules.py                                   @DataDog/apm-core-python
ddtrace/internal/packages.py                                            @DataDog/apm-core-python
ddtrace/internal/telemetry/data.py                                      @DataDog/apm-core-python
ddtrace/internal/telemetry/writer.py                                    @DataDog/apm-core-python
tests/telemetry/test_data.py                                            @DataDog/apm-core-python
tests/telemetry/test_writer.py                                          @DataDog/apm-core-python

github-actions[bot] avatar Jul 16 '24 13:07 github-actions[bot]