kokkos-tools icon indicating copy to clipboard operation
kokkos-tools copied to clipboard

Use probability sampling over periodic sampling

Open vlkale opened this issue 2 years ago • 6 comments

Fix #180 .

This PR is related to old PR #181

The current Kokkos sampler utility uses periodic sampling via a sampler skip rate. Doing this is often restrictive when sampling profiling and debugging data. For example, doing this can miss out on important data not in the periodicity of kernel invocations. The goal of this PR for the Kokkos sampler utility is to allow user to use random sampling primarily and periodic sampling secondarily via environment variables in the form KOKKOS_TOOLS_SAMPLER_xyz.

Since the solution should not allow for a combination of both periodicity and probability, the probability will always be chosen.

For example, let us say that a user requests a a Kokkos::parallel_for() every 20th invocation of that Kokkos::parallel_for() and requests gather time spent on Kokkos::parallel_for() with probability 63% on each invocation of that Kokkos::parallel_for(). Then, the sampler will not skip trying to time any invocations of the Kokkos::parallel_for() but it will obtain a timing with probability 63% on each invocation of that Kokkos::parallel_for().

See the common/kokkos-sampler/README.md directory for a high-level overview - in English - of the changes.

For later: I will put in slide in the Kokkos Tools tutorial slide on sampling and filtering to explain how to use these utilities.

vlkale avatar Oct 13 '23 02:10 vlkale

The following two outputs show that the sampling with probability 1.0% works properly when applied to kernel timer Kokkos tool for stream benchmark in the Kokkos core benchmark folder.

The first output is with Kokkos tools global fences being on (tool-induced fencing is enabled) and the second output is with global fencing turned off. In the second case, fencing is not invoked, as expected. Also, note that the set of Kokkos kernel invocation numbers that is sampled is different across these two different runs. The random number generator is seeded with the current time, making the invocations sampled different. The following is run on a MacOS with gcc and Kokkos 4.1.

vlkale@s1088602ca stream % export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_SAMPLER_PROB=1.0; export KOKKOS_TOOLS_GLOBALFENCES=1; export KOKKOS_TOOLS_LIBS="/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/common/kokkos-sampler/kp_sampler.so;/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so"; ./stream.exe 
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so
KokkosP: Loading child library ..
KokkosP: Simple Kernel Timer Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       yes
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 20
KokkosP: Sampling rate provided as input: 20
KokkosP: Sampling probability provided as input: 1.0
KokkosP: Sampling rate set to: 21
KokkosP: Sampling probability set to 1.000000
KokkosP: seeding Random Number Generator using clock for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 200 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 12 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 12 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 267 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 267 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 362 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 362 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 468 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 468 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 503 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 503 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 579 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 579 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 657 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 657 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 925 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: sample 925 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                62629.87 MB/s
Copy               74669.24 MB/s
Scale              74154.04 MB/s
Add                83099.73 MB/s
Triad              82674.88 MB/s
-------------------------------------------------------------
KokkosP: Kernel timing written to /Users/vlkale/Desktop/vlap/wk/code/softwareTech/kokkos/benchmarks/stream/s1088602ca-41172.dat 
vlkale@s1088602ca stream % export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_SAMPLER_PROB=1.0; export KOKKOS_TOOLS_GLOBALFENCES=0; export KOKKOS_TOOLS_LIBS="/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/common/kokkos-sampler/kp_sampler.so;/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so"; ./stream.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so
KokkosP: Loading child library ..
KokkosP: Simple Kernel Timer Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       yes
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 20
KokkosP: Sampling rate provided as input: 20
KokkosP: Sampling probability provided as input: 1.0
KokkosP: Sampling rate set to: 21
KokkosP: Sampling probability set to 1.000000
KokkosP: seeding Random Number Generator using clock for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 200 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 28 calling child-begin function...
KokkosP: sample 28 calling child-end function...
KokkosP: sample 296 calling child-begin function...
KokkosP: sample 296 calling child-end function...
KokkosP: sample 370 calling child-begin function...
KokkosP: sample 370 calling child-end function...
KokkosP: sample 377 calling child-begin function...
KokkosP: sample 377 calling child-end function...
KokkosP: sample 476 calling child-begin function...
KokkosP: sample 476 calling child-end function...
KokkosP: sample 503 calling child-begin function...
KokkosP: sample 503 calling child-end function...
KokkosP: sample 601 calling child-begin function...
KokkosP: sample 601 calling child-end function...
KokkosP: sample 693 calling child-begin function...
KokkosP: sample 693 calling child-end function...
KokkosP: sample 944 calling child-begin function...
KokkosP: sample 944 calling child-end function...
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                62633.34 MB/s
Copy               74470.13 MB/s
Scale              74797.96 MB/s
Add                83071.21 MB/s
Triad              82997.36 MB/s
-------------------------------------------------------------
KokkosP: Kernel timing written to /Users/vlkale/Desktop/vlap/wk/code/softwareTech/kokkos/benchmarks/stream/s1088602ca-41194.dat 
vlkale@s1088602ca stream %

vlkale avatar Oct 13 '23 02:10 vlkale

The below is a test with the most recently committed version with KOKKOS_TOOLS_SEED set, as requested by @crtrott. Two separate runs of the program stream were done with the following environment variables set. As seen from the output of both runs, they both have the same sequence of events sampled, showing that the manual seed rather than time-generated seed is working for this case. Note that the output is truncated for easily viewing the output within this GitHub issue.

ViveksMacBook: stream % export KOKKOS_TOOLS_LIBS="/Users/vivek/kto-inst/libkp_kokkos_sampler.dylib;/Users/Vivek/kto-inst/libkp_kernel_logger.dylib"; export KOKKOS_TOOLS_SEED=4; export KOKKOS_TOOLS_SEED=4; export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_SAMPLER_PROBABILITY=50.0; ./stream.exe;
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/vivek/kto-inst/libkp_kernel_logger.dylib
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       no
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 1
KokkosP: Sampling rate provided as input: 1
KokkosP: Sampling probability provided as input: 50.0
KokkosP: Sampling rate set to: 2
KokkosP: Sampling probability set to 50.000000
KokkosP: Seeding random number generator using seed 4 for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
KokkosP: sample 1 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP:     Kokkos::View::initialization [a] via memset
KokkosP: sample 1 finished with child-begin function.
KokkosP: sample 1 calling child-end function...
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 3 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP:     Kokkos::View::initialization [c] via memset
KokkosP: sample 3 finished with child-begin function.
KokkosP: sample 3 calling child-end function...
KokkosP: Execution of kernel 1 is completed.
Initializing Views...
Starting benchmarking...
KokkosP: sample 5 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP:     set
KokkosP: sample 5 finished with child-begin function.
KokkosP: sample 5 calling child-end function...
KokkosP: Execution of kernel 2 is completed.

...

KokkosP: sample 96 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 47
KokkosP:     copy
KokkosP: sample 96 finished with child-begin function.
KokkosP: sample 96 calling child-end function...
KokkosP: Execution of kernel 47 is completed.
KokkosP: sample 98 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 48
KokkosP:     add
KokkosP: sample 98 finished with child-begin function.
KokkosP: sample 98 calling child-end function...
KokkosP: Execution of kernel 48 is completed.
KokkosP: sample 99 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 49
KokkosP:     triad
KokkosP: sample 99 finished with child-begin function.
KokkosP: sample 99 calling child-end function...
KokkosP: Execution of kernel 49 is completed.
KokkosP: sample 101 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 50
KokkosP:     copy
KokkosP: sample 101 finished with child-begin function.
KokkosP: sample 101 calling child-end function...
KokkosP: Execution of kernel 50 is completed.
KokkosP: sample 103 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 51
KokkosP:     add
KokkosP: sample 103 finished with child-begin function.
KokkosP: sample 103 calling child-end function...
KokkosP: Execution of kernel 51 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                12593.03 MB/s
Copy               16127.57 MB/s
Scale              15996.91 MB/s
Add                16634.53 MB/s
Triad              17182.65 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.
ViveksMacBook stream % ./stream.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/Vivek/kto-inst/libkp_kernel_logger.dylib
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       no
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 1
KokkosP: Sampling rate provided as input: 1
KokkosP: Sampling probability provided as input: 50.0
KokkosP: Sampling rate set to: 2
KokkosP: Sampling probability set to 50.000000
KokkosP: Seeding random number generator using seed 4 for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
KokkosP: sample 1 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP:     Kokkos::View::initialization [a] via memset
KokkosP: sample 1 finished with child-begin function.
KokkosP: sample 1 calling child-end function...
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 3 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP:     Kokkos::View::initialization [c] via memset
KokkosP: sample 3 finished with child-begin function.
KokkosP: sample 3 calling child-end function...
KokkosP: Execution of kernel 1 is completed.
Initializing Views...
Starting benchmarking...
KokkosP: sample 5 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP:     set
KokkosP: sample 5 finished with child-begin function.
KokkosP: sample 5 calling child-end function...
KokkosP: Execution of kernel 2 is completed.

....

KokkosP: sample 96 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 47
KokkosP:     copy
KokkosP: sample 96 finished with child-begin function.
KokkosP: sample 96 calling child-end function...
KokkosP: Execution of kernel 47 is completed.
KokkosP: sample 98 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 48
KokkosP:     add
KokkosP: sample 98 finished with child-begin function.
KokkosP: sample 98 calling child-end function...
KokkosP: Execution of kernel 48 is completed.
KokkosP: sample 99 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 49
KokkosP:     triad
KokkosP: sample 99 finished with child-begin function.
KokkosP: sample 99 calling child-end function...
KokkosP: Execution of kernel 49 is completed.
KokkosP: sample 101 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 50
KokkosP:     copy
KokkosP: sample 101 finished with child-begin function.
KokkosP: sample 101 calling child-end function...
KokkosP: Execution of kernel 50 is completed.
KokkosP: sample 103 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 51
KokkosP:     add
KokkosP: sample 103 finished with child-begin function.
KokkosP: sample 103 calling child-end function...
KokkosP: Execution of kernel 51 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                12432.48 MB/s
Copy               15896.57 MB/s
Scale              15927.12 MB/s
Add                17134.47 MB/s
Triad              17060.44 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

vlkale avatar Oct 23 '23 23:10 vlkale

Lets add a user option to set the seed, and don't delete the erase

Done.

vlkale avatar Oct 25 '23 14:10 vlkale

For additional example run using the Kokkos Tools Sampler with randomized sampling, below is the build and subsequent output with sampler utility applied to the Kernel logger tool connector library run with the stream.cuda executable on Perlmutter for 200 outer iterations, i.e., 200 timesteps. The sampling is done at 1.2% probability, i.e., for an invocation of any Kokkos kernel in the program, there is a 1.2% chance that the kernel will be printed/logged on the screen to a user.

The below output is reproducible on Perlmutter through using module load PrgEnv-gnu, building with the Kokkos develop branch and using the Kokkos Serial+CUDA backend (build line from this Kokkos Tools PR is shown below before the output), setting the KOKKOS_TOOLS_RANDOM_SEED variable to 2 (if you set this variable to another number you will get another set of invocations that are in the sampled set).

Build


vkale3@perlmutter:login29:~/kto-dev-vlk/nvtxbld> ccmake .. -DCMAKE_INSTALL_PREFIX="/global/u2/v/vkale3/ktins20240222-2" -DKokkos_COMPILE_LAUNCHER=/global/u2/v/vkale3/kks/bin/kokkos_launch_compiler -DKokkos_DIR=/global/u2/v/vkale3/kks/kbuild-cuda -DKokkos_NVCC_WRAPPER="/global/u2/v/vkale3/kks/bin/nvcc_wrapper" 

Output

vkale3@perlmutter:login13:~/kks/benchmarks/stream> export KOKKOS_TOOLS_LIBS="/global/u2/v/vkale3/ktins20240222-2/lib64/libkp_kokkos_sampler.so;/global/u2/v/vkale3/ktins20240222-2/lib64/libkp_kernel_logger.so"; export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_RANDOM_SEED=2; export KOKKOS_TOOLS_GLOBALFENCES=1; export KOKKOS_TOOLS_SAMPLER_PROB=1.2; /global/homes/v/vkale3/kks/benchmarks/stream/stream.cuda  
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/u2/v/vkale3/ktins20240222-2/lib64/libkp_kernel_logger.so
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       no
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 20
KokkosP: Sampling skip rate provided as input is: 20
KokkosP: Sampling probability provided as input is: 1.2
KokkosP: Sampling skip rate is set to: 21
KokkosP: Sampling probability is set to 1.200000
KokkosP: Seeding random number generator using seed 2 for random sampling.
KokkosP: You set both the probability and skip rate for the sampler. Only random sampling will be done, using the probabability you set; The skip rate you set will be ignored.
KokkosP: Note: The skip rate will be set to 1. Sampling will not be based  on a pre-defined periodicity.
Kokkos: Kokkos_Profiling.cpp:initialize: actions.fence = 0x10f6390 	 fenceFnPtr = (nil) 
Kokkos: Kokkos_Profiling.cpp:initialize: actions 0x7fff6d1179d0 	 fence address 0x10f6390
Kokkos: Kokkos_Profiling.cpp:initialize: tool_invoked_fence address 0x41a8f0 	 fenceFnPtr 0x41a8f0 
Kokkos: Kokkos_Profiling.cpp:initialize: after fence init actions 0x7fff6d1179d0 	 fence address 0x41a8f0
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    1000
- Per Array:             0.01 MB
- Total:                 0.02 MB
Benchmark kernels will be performed for 200 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 220 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 0
KokkosP:     scale
KokkosP: sample 220 finished with child-begin function.
KokkosP: sample 220 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 246 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 1
KokkosP:     add
KokkosP: sample 246 finished with child-begin function.
KokkosP: sample 246 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 1 is completed.
KokkosP: sample 304 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 2
KokkosP:     copy
KokkosP: sample 304 finished with child-begin function.
KokkosP: sample 304 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 2 is completed.
KokkosP: sample 403 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 3
KokkosP:     set
KokkosP: sample 403 finished with child-begin function.
KokkosP: sample 403 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 3 is completed.
KokkosP: sample 528 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 4
KokkosP:     set
KokkosP: sample 528 finished with child-begin function.
KokkosP: sample 528 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 4 is completed.
KokkosP: sample 625 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 5
KokkosP:     scale
KokkosP: sample 625 finished with child-begin function.
KokkosP: sample 625 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 5 is completed.
KokkosP: sample 642 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 6
KokkosP:     triad
KokkosP: sample 642 finished with child-begin function.
KokkosP: sample 642 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 6 is completed.
KokkosP: sample 737 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 7
KokkosP:     triad
KokkosP: sample 737 finished with child-begin function.
KokkosP: sample 737 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 7 is completed.
KokkosP: sample 804 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 8
KokkosP:     copy
KokkosP: sample 804 finished with child-begin function.
KokkosP: sample 804 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 8 is completed.
KokkosP: sample 849 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 9
KokkosP:     copy
KokkosP: sample 849 finished with child-begin function.
KokkosP: sample 849 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 9 is completed.
KokkosP: sample 927 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 10
KokkosP:     triad
KokkosP: sample 927 finished with child-begin function.
KokkosP: sample 927 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 10 is completed.
KokkosP: sample 957 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 11
KokkosP:     triad
KokkosP: sample 957 finished with child-begin function.
KokkosP: sample 957 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 11 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                  956.25 MB/s
Copy                1892.15 MB/s
Scale               1852.71 MB/s
Add                 2750.40 MB/s
Triad               2808.33 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

vlkale avatar Feb 22 '24 19:02 vlkale

Lets add a user option to set the seed, and don't delete the erase This is put in via KOKKOS_TOOLS_RANDOM_SEED

vlkale avatar Feb 22 '24 19:02 vlkale

This PR now has tests for the probability sampling and is rebased with develop.

vlkale avatar Apr 12 '24 00:04 vlkale