omnitrace icon indicating copy to clipboard operation
omnitrace copied to clipboard

Omnitrace hangs during post-processing

Open jordap opened this issue 3 years ago • 3 comments

I'm experiencing some issues when profiling a Python application. The code I'm running is the language modeling example available in the ROCmSoftwarePlatform/transformers repository. It's executed as follows:

python run_mlm.py --model_name_or_path bert-large-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --logging_steps 1 --output_dir /tmp/test-mlm-bbu --overwrite_output_dir --per_device_train_batch_size 8 --fp16 --skip_memory_metrics=True --cache_dir /tmp/bert_cache --max_steps 160 

(The first time it's executed, it will download and cache input datasets from public sources.)

One key parameter is --max-steps. In shorter executions with a smaller number of steps (<= 16), Omnitrace seems to work just fine. In longer executions when the value is higher (>= 160), Omnitrace gets stuck during post-processing.

I'm running Omnitrace with Timemory, and disabling process sampling:

OMNITRACE_USE_PROCESS_SAMPLING=OFF OMNITRACE_USE_TIMEMORY=ON python3 -m omnitrace -- run_mlm.py [...] --max_steps 160

Sample Omnitrace log available. These are the last lines I see printed to stderr (edited):

[...]
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:124@'operator()']> [wall_clock]> merging 10816 hash-aliases into existing set of 19625 hash-aliases!...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:174@'merge']> [wall_clock]> worker is merging 1 records into 62190 records...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:223@'merge']> wall_clock master has 62191 records...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:318@'merge']> [wall_clock]> clearing merged storage!...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:151@'~storage']> [tim::component::wall_clock|3]> destroying storage...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:162@'~storage']> [tim::component::wall_clock|3]> merging into primary instance...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:90@'merge']> [wall_clock]> merging rhs=1 into lhs=62191...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:105@'operator()']> [wall_clock]> merging 7207 hash-ids into existing set of 10449 hash-ids!...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:124@'operator()']> [wall_clock]> merging 10816 hash-aliases into existing set of 19625 hash-aliases!...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:174@'merge']> [wall_clock]> worker is merging 1 records into 62191 records...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:223@'merge']> wall_clock master has 62192 records...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:318@'merge']> [wall_clock]> clearing merged storage!...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:151@'~storage']> [tim::component::wall_clock|4]> destroying storage
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:180@'~storage']> [tim::component::wall_clock|3]> deleting graph data...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:187@'~storage']> [tim::component::wall_clock|3]> storage destroyed...
[pid=13003][tid=3][timemory/source/timemory/storage/base_storage.cpp:103@'~storage']> base::storage instance 3 deleted for tim::component::wall_clock...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:180@'~storage']> [tim::component::wall_clock|4]> deleting graph data...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:187@'~storage']> [tim::component::wall_clock|4]> storage destroyed...
[pid=13003][tid=4][timemory/source/timemory/storage/base_storage.cpp:103@'~storage']> base::storage instance 4 deleted for tim::component::wall_clock...

jordap avatar Sep 28 '22 21:09 jordap

What version of omnitrace?

jrmadsen avatar Sep 28 '22 21:09 jrmadsen

Omnitrace 1.6.0. Installed with the DEB package for Ubuntu 20.04: omnitrace_1.6.0-ubuntu20.04.ROCm50200.PAPI.OMPT.Python3_amd64.deb.

jordap avatar Sep 29 '22 00:09 jordap

I tried to do more selective profiling of a single function. This function includes the main loop and all GPU-related calls. As expected, this makes the output smaller: the wall_clock has a couple thousand lines instead of tens of thousands.

But I keep having the same issues with OMNITRACE_USE_PROCESS_SAMPLING=OFF. I only manage to get traces with OMNITRACE_USE_ROCM_SMI=OFF.

jordap avatar Sep 30 '22 16:09 jordap

I've figure out the source of why it is taking so long to post-process the rocm-smi data. I'll include it in the next release.

jrmadsen avatar Oct 20 '22 18:10 jrmadsen