Add event tracing and ETDumps to executor_runner
- Enabled via EXECUTORCH_ENABLE_EVENT_TRACER
- Add flag 'etdump_path' to specify the file path for the ETDump file
- Add flag 'num_executions' for number of iterations to run
- Create and pass event tracer 'ETDumpGen'
- Save ETDump to disk
- Update docs to reflect the changes
Re-upload of https://github.com/pytorch/executorch/pull/4502 to discuss with @GregoryComer.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5027
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit 662cb81726972c7f0e305d2ed0d4adc344686076 with merge base 7bc06d1a22a19c2a9546e66aae550572935ffa27 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@pytorchbot label 'partner: arm'
@pytorchbot label ciflow/trunk
Can't add following labels to PR: ciflow/trunk. Please ping one of the reviewers for help.
Hi @GregoryComer. Would it be possible to run the CI on your side to see if the issue from the previous PR is still occurring? I'm having a hard time understanding where this comes from.
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Hi @dbort . Sorry for dragging you into this, but I saw your comment on EXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT in the code, so I thought you might be able to help with resolving the failing test here. Any idea how to fix this?
I don't see a CI failure anymore
I don't see a CI failure anymore
@digantdesai To me pull / test-llama-runner-qnn-linux (fp32, cmake, qnn) / linux-job (pull_request) is showing up as failing after my latest update. The CI run for the previous version you imported did not finish for me, i.e. I could not see any results, but it did not seem to have this test included anyway.
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Yeah I see, from the main CMakeList, and qnn does have EXECUTORCH_ENABLE_EVENT_TRACER=ON
Any update on this?
Hi @digantdesai, I'm still hoping for some pointer from @dbort or you as I'm struggling to reproduce it locally and can't really make sense of the error.
@digantdesai will you have a look at this one since it touches code outside arm delegate. Thx!
The error shows up when running this script https://github.com/pytorch/executorch/blob/main/backends/qualcomm/scripts/build.sh based on the log.
If you have a linux machine, can you follow https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html and see if the script fails?
@cccclai I finally managed to reproduce the issue by running script backends/qualcomm/scripts/build.sh with the parameters from the CI script here. Interestingly, the issue seems to be caused by --job_number 2 (not my first guess). If I remove the parameter entirely, defaulting to --job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.
@cccclai I finally managed to reproduce the issue by running script
backends/qualcomm/scripts/build.shwith the parameters from the CI script here. Interestingly, the issue seems to be caused by--job_number 2(not my first guess). If I remove the parameter entirely, defaulting to--job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.
Ah I remember that. @dbort and @Olivia-liu, any thought on this?
Also for what it's worth, I'm trying to merge https://github.com/dvidelabs/flatcc/pull/306 into upstream flatcc to let us remove -DEXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT and similar hacks
gmake[2]: *** No rule to make target '../third-party/flatcc/lib/libflatccrt.a', needed by 'executor_runner'. Stop.
Thanks @dbort , this is the error exactly.
I'm not sure about the workaround to use release mode. The command used in the ci script is already using --release. As mentioned above, removing --num_jobs 2 seems to work locally, but it's a strange workaround and might not work in CI.
I'm not sure about the workaround to use release mode. The command used in the ci script is already using --release.
Ok, thanks for looking into that.
As mentioned above, removing --num_jobs 2 seems to work locally, but it's a strange workaround and might not work in CI.
That means that there's some kind of race condition.
Based on your PR, it looks like executor_runner has a proper dependency on libflatccrt; otherwise I would have expected a use-without-dependency situation.
...except maybe it doesn't. This PR adds a dep on "${FLATCCRT_LIB}" from the top-level CMakeLists.txt, but when I look for a place that sets FLATCCRT_LIB I only see it in the cmake config file at https://github.com/pytorch/executorch/blob/538bfaf3d5a25baf2d094c178b0fe9e708ed16f2/build/executorch-config.cmake#L51-L55
afaik, that config isn't included in the top-level cmake system. That file is used to point to an already-built version of the core ET libs from external projects, like https://github.com/pytorch/executorch/blob/538bfaf3d5a25baf2d094c178b0fe9e708ed16f2/examples/devtools/CMakeLists.txt#L39
If FLATCCRT_LIB is empty in this PR at https://github.com/pytorch/executorch/pull/5027/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR817-R819
list(APPEND _executor_runner_libs etdump ${FLATCCRT_LIB})
then executor_runner wouldn't properly depend on libflatccrt.a. A parallel build could cause that lib to be coincidentally built earlier with -j16, "fixing" the problem, while -j2 would be less likely to do so.
Could you try printing the value of FLATCCRT_LIB from the top-level CMakeLists.txt to see if it's empty?
Though theoretically executor_runner shouldn't even need to know about libflatccrt: it should inherit the dep from the PUBLIC section of
https://github.com/pytorch/executorch/blob/538bfaf3d5a25baf2d094c178b0fe9e708ed16f2/devtools/CMakeLists.txt#L179-L183
But in this case, you could try updating this PR to use flatccrt as the dep instead of using ${FLATCCRT_LIB}. Even if FLATCCRT_LIB were defined, I think it's actually wrong to use it as the dep name -- I believe that the target is always called flatccrt even if, in debug mode, the file that it generates is called libflatccrt_d.a.
Hi @dbort . I tried fixing the flatccrt dependency as suggested, but without any effect:
- Replace
${FLATCCRT_LIB}withflatccrt - Remove
flatccrtdependency completely and rely on inherited dependency frometdump
(I did this both in CMakeLists.txt and examples/qualcomm/executor_runner/CMakeLists.txt)
I did find a new workaround though, which should be more stable than just removing the --num_jobs 2:
Run the command to build the QNN SDK twice, first clean then without cleaning in file .ci/scripts/build-qnn-sdk.sh.
I feel like the flatccrt issue is not related to my change so I will open an issue for it. I can push the above workaround in a separate PR. I will be off from tomorrow until next year, but I really hope we can find a solution together to get this PR merged.
@dbort has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Thanks for updating the dependency, and again I apologize for how long this has taken me to review. I'm running internal tests now and should be able to merge this soon.
Thanks @dbort ! There is a linker error remaining in the qnn tests that I could not reproduce locally and I don't understand where it is coming from. Maybe you have an idea...
Thanks @dbort ! I managed to find and fix the issue by following your pointers a little further. I think the remaining CI failures are not related to my change.
The problem was that the QNN backend always set the definition ET_EVENT_TRACER_ENABLED without taking into account the status of CMake flag EXECUTORCH_ENABLE_EVENT_TRACER. I.e. when building with EXECUTORCH_BUILD_QNN=ON the definition ET_EVENT_TRACER_ENABLED was always enabled at compile time, but the needed libs might not have been linked depending on the CMake flags.
The fix: I removed the definition in the QNN backend here so that the behavior should now be fully controlled by the CMake flags.
@benkli01 Thank you for tracking down the problem with the Qualcomm jobs! @shewu-quic @cccclai please let us know if the change to qualcomm/CMakeLists.txt is ok.
@dbort has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.