executorch icon indicating copy to clipboard operation
executorch copied to clipboard

Add event tracing and ETDumps to executor_runner

Open benkli01 opened this issue 1 year ago • 15 comments

  • Enabled via EXECUTORCH_ENABLE_EVENT_TRACER
  • Add flag 'etdump_path' to specify the file path for the ETDump file
  • Add flag 'num_executions' for number of iterations to run
  • Create and pass event tracer 'ETDumpGen'
  • Save ETDump to disk
  • Update docs to reflect the changes

Re-upload of https://github.com/pytorch/executorch/pull/4502 to discuss with @GregoryComer.

benkli01 avatar Sep 02 '24 11:09 benkli01

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5027

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 662cb81726972c7f0e305d2ed0d4adc344686076 with merge base 7bc06d1a22a19c2a9546e66aae550572935ffa27 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Sep 02 '24 11:09 pytorch-bot[bot]

@pytorchbot label 'partner: arm'

benkli01 avatar Sep 02 '24 11:09 benkli01

@pytorchbot label ciflow/trunk

benkli01 avatar Sep 02 '24 11:09 benkli01

Can't add following labels to PR: ciflow/trunk. Please ping one of the reviewers for help.

pytorch-bot[bot] avatar Sep 02 '24 11:09 pytorch-bot[bot]

Hi @GregoryComer. Would it be possible to run the CI on your side to see if the issue from the previous PR is still occurring? I'm having a hard time understanding where this comes from.

benkli01 avatar Sep 04 '24 14:09 benkli01

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Sep 10 '24 02:09 facebook-github-bot

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Sep 12 '24 16:09 facebook-github-bot

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Sep 17 '24 14:09 facebook-github-bot

Hi @dbort . Sorry for dragging you into this, but I saw your comment on EXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT in the code, so I thought you might be able to help with resolving the failing test here. Any idea how to fix this?

benkli01 avatar Sep 23 '24 15:09 benkli01

I don't see a CI failure anymore

digantdesai avatar Sep 23 '24 15:09 digantdesai

I don't see a CI failure anymore

@digantdesai To me pull / test-llama-runner-qnn-linux (fp32, cmake, qnn) / linux-job (pull_request) is showing up as failing after my latest update. The CI run for the previous version you imported did not finish for me, i.e. I could not see any results, but it did not seem to have this test included anyway.

benkli01 avatar Sep 23 '24 15:09 benkli01

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Sep 24 '24 18:09 facebook-github-bot

Yeah I see, from the main CMakeList, and qnn does have EXECUTORCH_ENABLE_EVENT_TRACER=ON

digantdesai avatar Sep 24 '24 18:09 digantdesai

Any update on this?

digantdesai avatar Sep 30 '24 15:09 digantdesai

Hi @digantdesai, I'm still hoping for some pointer from @dbort or you as I'm struggling to reproduce it locally and can't really make sense of the error.

benkli01 avatar Sep 30 '24 16:09 benkli01

@digantdesai will you have a look at this one since it touches code outside arm delegate. Thx!

freddan80 avatar Nov 29 '24 07:11 freddan80

The error shows up when running this script https://github.com/pytorch/executorch/blob/main/backends/qualcomm/scripts/build.sh based on the log.

If you have a linux machine, can you follow https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html and see if the script fails?

cccclai avatar Nov 30 '24 18:11 cccclai

@cccclai I finally managed to reproduce the issue by running script backends/qualcomm/scripts/build.sh with the parameters from the CI script here. Interestingly, the issue seems to be caused by --job_number 2 (not my first guess). If I remove the parameter entirely, defaulting to --job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.

benkli01 avatar Dec 02 '24 16:12 benkli01

@cccclai I finally managed to reproduce the issue by running script backends/qualcomm/scripts/build.sh with the parameters from the CI script here. Interestingly, the issue seems to be caused by --job_number 2 (not my first guess). If I remove the parameter entirely, defaulting to --job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.

Ah I remember that. @dbort and @Olivia-liu, any thought on this?

cccclai avatar Dec 04 '24 17:12 cccclai

Also for what it's worth, I'm trying to merge https://github.com/dvidelabs/flatcc/pull/306 into upstream flatcc to let us remove -DEXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT and similar hacks

dbort avatar Dec 05 '24 01:12 dbort

gmake[2]: *** No rule to make target '../third-party/flatcc/lib/libflatccrt.a', needed by 'executor_runner'.  Stop.

Thanks @dbort , this is the error exactly.

I'm not sure about the workaround to use release mode. The command used in the ci script is already using --release. As mentioned above, removing --num_jobs 2 seems to work locally, but it's a strange workaround and might not work in CI.

benkli01 avatar Dec 05 '24 16:12 benkli01

I'm not sure about the workaround to use release mode. The command used in the ci script is already using --release.

Ok, thanks for looking into that.

As mentioned above, removing --num_jobs 2 seems to work locally, but it's a strange workaround and might not work in CI.

That means that there's some kind of race condition.

Based on your PR, it looks like executor_runner has a proper dependency on libflatccrt; otherwise I would have expected a use-without-dependency situation.

...except maybe it doesn't. This PR adds a dep on "${FLATCCRT_LIB}" from the top-level CMakeLists.txt, but when I look for a place that sets FLATCCRT_LIB I only see it in the cmake config file at https://github.com/pytorch/executorch/blob/538bfaf3d5a25baf2d094c178b0fe9e708ed16f2/build/executorch-config.cmake#L51-L55

afaik, that config isn't included in the top-level cmake system. That file is used to point to an already-built version of the core ET libs from external projects, like https://github.com/pytorch/executorch/blob/538bfaf3d5a25baf2d094c178b0fe9e708ed16f2/examples/devtools/CMakeLists.txt#L39

If FLATCCRT_LIB is empty in this PR at https://github.com/pytorch/executorch/pull/5027/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR817-R819

list(APPEND _executor_runner_libs etdump ${FLATCCRT_LIB})

then executor_runner wouldn't properly depend on libflatccrt.a. A parallel build could cause that lib to be coincidentally built earlier with -j16, "fixing" the problem, while -j2 would be less likely to do so.

Could you try printing the value of FLATCCRT_LIB from the top-level CMakeLists.txt to see if it's empty?

dbort avatar Dec 07 '24 01:12 dbort

Though theoretically executor_runner shouldn't even need to know about libflatccrt: it should inherit the dep from the PUBLIC section of

https://github.com/pytorch/executorch/blob/538bfaf3d5a25baf2d094c178b0fe9e708ed16f2/devtools/CMakeLists.txt#L179-L183

But in this case, you could try updating this PR to use flatccrt as the dep instead of using ${FLATCCRT_LIB}. Even if FLATCCRT_LIB were defined, I think it's actually wrong to use it as the dep name -- I believe that the target is always called flatccrt even if, in debug mode, the file that it generates is called libflatccrt_d.a.

dbort avatar Dec 07 '24 01:12 dbort

Hi @dbort . I tried fixing the flatccrt dependency as suggested, but without any effect:

  1. Replace ${FLATCCRT_LIB} with flatccrt
  2. Remove flatccrt dependency completely and rely on inherited dependency from etdump

(I did this both in CMakeLists.txt and examples/qualcomm/executor_runner/CMakeLists.txt)

I did find a new workaround though, which should be more stable than just removing the --num_jobs 2: Run the command to build the QNN SDK twice, first clean then without cleaning in file .ci/scripts/build-qnn-sdk.sh.

I feel like the flatccrt issue is not related to my change so I will open an issue for it. I can push the above workaround in a separate PR. I will be off from tomorrow until next year, but I really hope we can find a solution together to get this PR merged.

benkli01 avatar Dec 12 '24 12:12 benkli01

@dbort has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jan 22 '25 23:01 facebook-github-bot

Thanks for updating the dependency, and again I apologize for how long this has taken me to review. I'm running internal tests now and should be able to merge this soon.

dbort avatar Jan 22 '25 23:01 dbort

Thanks @dbort ! There is a linker error remaining in the qnn tests that I could not reproduce locally and I don't understand where it is coming from. Maybe you have an idea...

benkli01 avatar Jan 23 '25 10:01 benkli01

Thanks @dbort ! I managed to find and fix the issue by following your pointers a little further. I think the remaining CI failures are not related to my change.

The problem was that the QNN backend always set the definition ET_EVENT_TRACER_ENABLED without taking into account the status of CMake flag EXECUTORCH_ENABLE_EVENT_TRACER. I.e. when building with EXECUTORCH_BUILD_QNN=ON the definition ET_EVENT_TRACER_ENABLED was always enabled at compile time, but the needed libs might not have been linked depending on the CMake flags.

The fix: I removed the definition in the QNN backend here so that the behavior should now be fully controlled by the CMake flags.

benkli01 avatar Jan 24 '25 15:01 benkli01

@benkli01 Thank you for tracking down the problem with the Qualcomm jobs! @shewu-quic @cccclai please let us know if the change to qualcomm/CMakeLists.txt is ok.

dbort avatar Jan 28 '25 23:01 dbort

@dbort has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jan 28 '25 23:01 facebook-github-bot