dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

feat(profiling): add support for pytorch profiling

Open sanchda opened this issue 1 year ago • 1 comments

PR does

  • Patches torch.profiler.profile class by adding our own on_trace_ready handler
  • Adds GPU time/flops/memory samples via libdatadog interface in on_trace_ready event handler
  • Ensures that libdd exporter is enabled if pytorch is enabled
  • changelog entry
  • Is there a minimum python version?
    • the biggest requirement is that the current pytorch profiler API which we instrument was introduced in torch version 1.9 (https://pytorch.org/blog/pytorch-1.9-released/), do we just want to document or we could disable the instrumentation if we detect an outdated version with torch.__version__

Still need

  • Probably should make experimental/beta collectors not part of the ALL template
  • Some documentation on needed user configuration, conflicting features, gotchas

Checklist

  • [x] Change(s) are motivated and described in the PR description
  • [x] Testing strategy is described if automated tests are not included in the PR
  • [x] Risks are described (performance impact, potential for breakage, maintainability)
  • [x] Change is maintainable (easy to change, telemetry, documentation)
  • [x] Library release note guidelines are followed or label changelog/no-changelog is set
  • [x] Documentation is included (in-code, generated user docs, public corp docs)
  • [x] Backport labels are set (if applicable)
  • [x] If this PR changes the public interface, I've notified @DataDog/apm-tees.

Reviewer Checklist

  • [ ] Title is accurate
  • [ ] All changes are related to the pull request's stated goal
  • [ ] Description motivates each change
  • [ ] Avoids breaking API changes
  • [ ] Testing strategy adequately addresses listed risks
  • [ ] Change is maintainable (easy to change, telemetry, documentation)
  • [ ] Release note makes sense to a user of the library
  • [ ] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • [ ] Backport labels are set in a manner that is consistent with the release branch maintenance policy

sanchda avatar May 03 '24 14:05 sanchda

Benchmarks

Benchmark execution time: 2024-12-13 22:37:43

Comparing candidate commit 939321620d0038de3709bef2722e96539737cd64 in PR branch peterg17/pytorch_profiling_integration2 with baseline commit 1dd528c0ef5d04e2b095f61d8a15e8fc15cbb00a in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

pr-commenter[bot] avatar May 21 '24 15:05 pr-commenter[bot]

Datadog Report

Branch report: peterg17/pytorch_profiling_integration2 Commit report: 28ed224 Test service: dd-trace-py

:white_check_mark: 0 Failed, 389 Passed, 1219 Skipped, 44m 19.55s Total duration (42m 52.03s time saved)

This pull request has been automatically closed after a period of inactivity. After this much time, it will likely be easier to open a new pull request with the same changes than to update this one from the base branch. Please comment or reopen if you think this pull request was closed in error.

github-actions[bot] avatar Oct 24 '24 00:10 github-actions[bot]

CODEOWNERS have been resolved as:

.github/workflows/pytorch_gpu_tests.yml                                 @DataDog/python-guild @DataDog/apm-core-python
ddtrace/profiling/collector/pytorch.py                                  @DataDog/profiling-python
docs/pytorch_metric.png                                                 @DataDog/python-guild
releasenotes/notes/profiling-add-pytorch-integration-0683123b7bb83f99.yaml  @DataDog/apm-python
tests/profiling_v2/simple_program_pytorch_gpu.py                        @DataDog/profiling-python
tests/profiling_v2/test_pytorch.py                                      @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/include/ddup_interface.hpp  @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/include/libdatadog_helpers.hpp  @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/include/sample.hpp        @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/include/types.hpp         @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/ddup_interface.cpp    @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/profile.cpp           @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp            @DataDog/profiling-python
ddtrace/internal/datadog/profiling/ddup/_ddup.pyi                       @DataDog/profiling-python
ddtrace/internal/datadog/profiling/ddup/_ddup.pyx                       @DataDog/profiling-python
ddtrace/profiling/profiler.py                                           @DataDog/profiling-python
ddtrace/settings/profiling.py                                           @DataDog/profiling-python
docs/advanced_usage.rst                                                 @DataDog/python-guild
docs/spelling_wordlist.txt                                              @DataDog/python-guild
hatch.toml                                                              @DataDog/python-guild

github-actions[bot] avatar Dec 09 '24 21:12 github-actions[bot]

Datadog Report

Branch report: peterg17/pytorch_profiling_integration2 Commit report: a4492b7 Test service: dd-trace-py

:white_check_mark: 0 Failed, 769 Passed, 699 Skipped, 15m 20.28s Total duration (58m 16.46s time saved)