Use probability sampling over periodic sampling
Fix #180 .
This PR is related to old PR #181
The current Kokkos sampler utility uses periodic sampling via a sampler skip rate. Doing this is often restrictive when sampling profiling and debugging data. For example, doing this can miss out on important data not in the periodicity of kernel invocations. The goal of this PR for the Kokkos sampler utility is to allow user to use random sampling primarily and periodic sampling secondarily via environment variables in the form KOKKOS_TOOLS_SAMPLER_xyz.
Since the solution should not allow for a combination of both periodicity and probability, the probability will always be chosen.
For example, let us say that a user requests a a Kokkos::parallel_for() every 20th invocation of that Kokkos::parallel_for() and requests gather time spent on Kokkos::parallel_for() with probability 63% on each invocation of that Kokkos::parallel_for(). Then, the sampler will not skip trying to time any invocations of the Kokkos::parallel_for() but it will obtain a timing with probability 63% on each invocation of that Kokkos::parallel_for().
See the common/kokkos-sampler/README.md directory for a high-level overview - in English - of the changes.
For later: I will put in slide in the Kokkos Tools tutorial slide on sampling and filtering to explain how to use these utilities.
The following two outputs show that the sampling with probability 1.0% works properly when applied to kernel timer Kokkos tool for stream benchmark in the Kokkos core benchmark folder.
The first output is with Kokkos tools global fences being on (tool-induced fencing is enabled) and the second output is with global fencing turned off. In the second case, fencing is not invoked, as expected. Also, note that the set of Kokkos kernel invocation numbers that is sampled is different across these two different runs. The random number generator is seeded with the current time, making the invocations sampled different. The following is run on a MacOS with gcc and Kokkos 4.1.
vlkale@s1088602ca stream % export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_SAMPLER_PROB=1.0; export KOKKOS_TOOLS_GLOBALFENCES=1; export KOKKOS_TOOLS_LIBS="/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/common/kokkos-sampler/kp_sampler.so;/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so"; ./stream.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so
KokkosP: Loading child library ..
KokkosP: Simple Kernel Timer Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for: yes
KokkosP: begin-parallel-scan: yes
KokkosP: begin-parallel-reduce: yes
KokkosP: end-parallel-for: yes
KokkosP: end-parallel-scan: yes
KokkosP: end-parallel-reduce: yes
KokkosP: Sampling rate set to: 20
KokkosP: Sampling rate provided as input: 20
KokkosP: Sampling probability provided as input: 1.0
KokkosP: Sampling rate set to: 21
KokkosP: Sampling probability set to 1.000000
KokkosP: seeding Random Number Generator using clock for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size: 100000000
- Per Array: 800.00 MB
- Total: 2400.00 MB
Benchmark kernels will be performed for 200 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 12 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 12 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 267 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 267 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 362 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 362 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 468 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 468 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 503 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 503 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 579 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 579 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 657 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 657 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 925 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
KokkosP: sample 925 calling child-end function...
KokkosP: Sampler utility sucessfully invoked tool-induced fence on device 0
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set 62629.87 MB/s
Copy 74669.24 MB/s
Scale 74154.04 MB/s
Add 83099.73 MB/s
Triad 82674.88 MB/s
-------------------------------------------------------------
KokkosP: Kernel timing written to /Users/vlkale/Desktop/vlap/wk/code/softwareTech/kokkos/benchmarks/stream/s1088602ca-41172.dat
vlkale@s1088602ca stream % export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_SAMPLER_PROB=1.0; export KOKKOS_TOOLS_GLOBALFENCES=0; export KOKKOS_TOOLS_LIBS="/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/common/kokkos-sampler/kp_sampler.so;/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so"; ./stream.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/ktov105/profiling/simple-kernel-timer/kp_kernel_timer.so
KokkosP: Loading child library ..
KokkosP: Simple Kernel Timer Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for: yes
KokkosP: begin-parallel-scan: yes
KokkosP: begin-parallel-reduce: yes
KokkosP: end-parallel-for: yes
KokkosP: end-parallel-scan: yes
KokkosP: end-parallel-reduce: yes
KokkosP: Sampling rate set to: 20
KokkosP: Sampling rate provided as input: 20
KokkosP: Sampling probability provided as input: 1.0
KokkosP: Sampling rate set to: 21
KokkosP: Sampling probability set to 1.000000
KokkosP: seeding Random Number Generator using clock for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size: 100000000
- Per Array: 800.00 MB
- Total: 2400.00 MB
Benchmark kernels will be performed for 200 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 28 calling child-begin function...
KokkosP: sample 28 calling child-end function...
KokkosP: sample 296 calling child-begin function...
KokkosP: sample 296 calling child-end function...
KokkosP: sample 370 calling child-begin function...
KokkosP: sample 370 calling child-end function...
KokkosP: sample 377 calling child-begin function...
KokkosP: sample 377 calling child-end function...
KokkosP: sample 476 calling child-begin function...
KokkosP: sample 476 calling child-end function...
KokkosP: sample 503 calling child-begin function...
KokkosP: sample 503 calling child-end function...
KokkosP: sample 601 calling child-begin function...
KokkosP: sample 601 calling child-end function...
KokkosP: sample 693 calling child-begin function...
KokkosP: sample 693 calling child-end function...
KokkosP: sample 944 calling child-begin function...
KokkosP: sample 944 calling child-end function...
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set 62633.34 MB/s
Copy 74470.13 MB/s
Scale 74797.96 MB/s
Add 83071.21 MB/s
Triad 82997.36 MB/s
-------------------------------------------------------------
KokkosP: Kernel timing written to /Users/vlkale/Desktop/vlap/wk/code/softwareTech/kokkos/benchmarks/stream/s1088602ca-41194.dat
vlkale@s1088602ca stream %
The below is a test with the most recently committed version with KOKKOS_TOOLS_SEED set, as requested by @crtrott. Two separate runs of the program stream were done with the following environment variables set. As seen from the output of both runs, they both have the same sequence of events sampled, showing that the manual seed rather than time-generated seed is working for this case. Note that the output is truncated for easily viewing the output within this GitHub issue.
ViveksMacBook: stream % export KOKKOS_TOOLS_LIBS="/Users/vivek/kto-inst/libkp_kokkos_sampler.dylib;/Users/Vivek/kto-inst/libkp_kernel_logger.dylib"; export KOKKOS_TOOLS_SEED=4; export KOKKOS_TOOLS_SEED=4; export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_SAMPLER_PROBABILITY=50.0; ./stream.exe;
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/vivek/kto-inst/libkp_kernel_logger.dylib
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for: yes
KokkosP: begin-parallel-scan: yes
KokkosP: begin-parallel-reduce: yes
KokkosP: end-parallel-for: yes
KokkosP: end-parallel-scan: no
KokkosP: end-parallel-reduce: yes
KokkosP: Sampling rate set to: 1
KokkosP: Sampling rate provided as input: 1
KokkosP: Sampling probability provided as input: 50.0
KokkosP: Sampling rate set to: 2
KokkosP: Sampling probability set to 50.000000
KokkosP: Seeding random number generator using seed 4 for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size: 100000000
- Per Array: 800.00 MB
- Total: 2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
KokkosP: sample 1 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP: Kokkos::View::initialization [a] via memset
KokkosP: sample 1 finished with child-begin function.
KokkosP: sample 1 calling child-end function...
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 3 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP: Kokkos::View::initialization [c] via memset
KokkosP: sample 3 finished with child-begin function.
KokkosP: sample 3 calling child-end function...
KokkosP: Execution of kernel 1 is completed.
Initializing Views...
Starting benchmarking...
KokkosP: sample 5 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP: set
KokkosP: sample 5 finished with child-begin function.
KokkosP: sample 5 calling child-end function...
KokkosP: Execution of kernel 2 is completed.
...
KokkosP: sample 96 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 47
KokkosP: copy
KokkosP: sample 96 finished with child-begin function.
KokkosP: sample 96 calling child-end function...
KokkosP: Execution of kernel 47 is completed.
KokkosP: sample 98 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 48
KokkosP: add
KokkosP: sample 98 finished with child-begin function.
KokkosP: sample 98 calling child-end function...
KokkosP: Execution of kernel 48 is completed.
KokkosP: sample 99 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 49
KokkosP: triad
KokkosP: sample 99 finished with child-begin function.
KokkosP: sample 99 calling child-end function...
KokkosP: Execution of kernel 49 is completed.
KokkosP: sample 101 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 50
KokkosP: copy
KokkosP: sample 101 finished with child-begin function.
KokkosP: sample 101 calling child-end function...
KokkosP: Execution of kernel 50 is completed.
KokkosP: sample 103 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 51
KokkosP: add
KokkosP: sample 103 finished with child-begin function.
KokkosP: sample 103 calling child-end function...
KokkosP: Execution of kernel 51 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set 12593.03 MB/s
Copy 16127.57 MB/s
Scale 15996.91 MB/s
Add 16634.53 MB/s
Triad 17182.65 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.
ViveksMacBook stream % ./stream.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /Users/Vivek/kto-inst/libkp_kernel_logger.dylib
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for: yes
KokkosP: begin-parallel-scan: yes
KokkosP: begin-parallel-reduce: yes
KokkosP: end-parallel-for: yes
KokkosP: end-parallel-scan: no
KokkosP: end-parallel-reduce: yes
KokkosP: Sampling rate set to: 1
KokkosP: Sampling rate provided as input: 1
KokkosP: Sampling probability provided as input: 50.0
KokkosP: Sampling rate set to: 2
KokkosP: Sampling probability set to 50.000000
KokkosP: Seeding random number generator using seed 4 for probabilistic sampling.
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with only specified sampling probability applied and sampling skip rate set is ignored with no predefined periodicity for sampling used.
KokkosP: The skip rate in the sampler utility is being set to 1.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size: 100000000
- Per Array: 800.00 MB
- Total: 2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
KokkosP: sample 1 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP: Kokkos::View::initialization [a] via memset
KokkosP: sample 1 finished with child-begin function.
KokkosP: sample 1 calling child-end function...
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 3 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP: Kokkos::View::initialization [c] via memset
KokkosP: sample 3 finished with child-begin function.
KokkosP: sample 3 calling child-end function...
KokkosP: Execution of kernel 1 is completed.
Initializing Views...
Starting benchmarking...
KokkosP: sample 5 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP: set
KokkosP: sample 5 finished with child-begin function.
KokkosP: sample 5 calling child-end function...
KokkosP: Execution of kernel 2 is completed.
....
KokkosP: sample 96 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 47
KokkosP: copy
KokkosP: sample 96 finished with child-begin function.
KokkosP: sample 96 calling child-end function...
KokkosP: Execution of kernel 47 is completed.
KokkosP: sample 98 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 48
KokkosP: add
KokkosP: sample 98 finished with child-begin function.
KokkosP: sample 98 calling child-end function...
KokkosP: Execution of kernel 48 is completed.
KokkosP: sample 99 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 49
KokkosP: triad
KokkosP: sample 99 finished with child-begin function.
KokkosP: sample 99 calling child-end function...
KokkosP: Execution of kernel 49 is completed.
KokkosP: sample 101 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 50
KokkosP: copy
KokkosP: sample 101 finished with child-begin function.
KokkosP: sample 101 calling child-end function...
KokkosP: Execution of kernel 50 is completed.
KokkosP: sample 103 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 51
KokkosP: add
KokkosP: sample 103 finished with child-begin function.
KokkosP: sample 103 calling child-end function...
KokkosP: Execution of kernel 51 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set 12432.48 MB/s
Copy 15896.57 MB/s
Scale 15927.12 MB/s
Add 17134.47 MB/s
Triad 17060.44 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.
Lets add a user option to set the seed, and don't delete the erase
Done.
For additional example run using the Kokkos Tools Sampler with randomized sampling, below is the build and subsequent output with sampler utility applied to the Kernel logger tool connector library run with the stream.cuda executable on Perlmutter for 200 outer iterations, i.e., 200 timesteps. The sampling is done at 1.2% probability, i.e., for an invocation of any Kokkos kernel in the program, there is a 1.2% chance that the kernel will be printed/logged on the screen to a user.
The below output is reproducible on Perlmutter through using module load PrgEnv-gnu, building with the Kokkos develop branch and using the Kokkos Serial+CUDA backend (build line from this Kokkos Tools PR is shown below before the output), setting the KOKKOS_TOOLS_RANDOM_SEED variable to 2 (if you set this variable to another number you will get another set of invocations that are in the sampled set).
Build
vkale3@perlmutter:login29:~/kto-dev-vlk/nvtxbld> ccmake .. -DCMAKE_INSTALL_PREFIX="/global/u2/v/vkale3/ktins20240222-2" -DKokkos_COMPILE_LAUNCHER=/global/u2/v/vkale3/kks/bin/kokkos_launch_compiler -DKokkos_DIR=/global/u2/v/vkale3/kks/kbuild-cuda -DKokkos_NVCC_WRAPPER="/global/u2/v/vkale3/kks/bin/nvcc_wrapper"
Output
vkale3@perlmutter:login13:~/kks/benchmarks/stream> export KOKKOS_TOOLS_LIBS="/global/u2/v/vkale3/ktins20240222-2/lib64/libkp_kokkos_sampler.so;/global/u2/v/vkale3/ktins20240222-2/lib64/libkp_kernel_logger.so"; export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_RANDOM_SEED=2; export KOKKOS_TOOLS_GLOBALFENCES=1; export KOKKOS_TOOLS_SAMPLER_PROB=1.2; /global/homes/v/vkale3/kks/benchmarks/stream/stream.cuda
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/u2/v/vkale3/ktins20240222-2/lib64/libkp_kernel_logger.so
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for: yes
KokkosP: begin-parallel-scan: yes
KokkosP: begin-parallel-reduce: yes
KokkosP: end-parallel-for: yes
KokkosP: end-parallel-scan: no
KokkosP: end-parallel-reduce: yes
KokkosP: Sampling rate set to: 20
KokkosP: Sampling skip rate provided as input is: 20
KokkosP: Sampling probability provided as input is: 1.2
KokkosP: Sampling skip rate is set to: 21
KokkosP: Sampling probability is set to 1.200000
KokkosP: Seeding random number generator using seed 2 for random sampling.
KokkosP: You set both the probability and skip rate for the sampler. Only random sampling will be done, using the probabability you set; The skip rate you set will be ignored.
KokkosP: Note: The skip rate will be set to 1. Sampling will not be based on a pre-defined periodicity.
Kokkos: Kokkos_Profiling.cpp:initialize: actions.fence = 0x10f6390 fenceFnPtr = (nil)
Kokkos: Kokkos_Profiling.cpp:initialize: actions 0x7fff6d1179d0 fence address 0x10f6390
Kokkos: Kokkos_Profiling.cpp:initialize: tool_invoked_fence address 0x41a8f0 fenceFnPtr 0x41a8f0
Kokkos: Kokkos_Profiling.cpp:initialize: after fence init actions 0x7fff6d1179d0 fence address 0x41a8f0
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size: 1000
- Per Array: 0.01 MB
- Total: 0.02 MB
Benchmark kernels will be performed for 200 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 220 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 0
KokkosP: scale
KokkosP: sample 220 finished with child-begin function.
KokkosP: sample 220 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 246 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 1
KokkosP: add
KokkosP: sample 246 finished with child-begin function.
KokkosP: sample 246 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 1 is completed.
KokkosP: sample 304 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 2
KokkosP: copy
KokkosP: sample 304 finished with child-begin function.
KokkosP: sample 304 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 2 is completed.
KokkosP: sample 403 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 3
KokkosP: set
KokkosP: sample 403 finished with child-begin function.
KokkosP: sample 403 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 3 is completed.
KokkosP: sample 528 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 4
KokkosP: set
KokkosP: sample 528 finished with child-begin function.
KokkosP: sample 528 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 4 is completed.
KokkosP: sample 625 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 5
KokkosP: scale
KokkosP: sample 625 finished with child-begin function.
KokkosP: sample 625 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 5 is completed.
KokkosP: sample 642 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 6
KokkosP: triad
KokkosP: sample 642 finished with child-begin function.
KokkosP: sample 642 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 6 is completed.
KokkosP: sample 737 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 7
KokkosP: triad
KokkosP: sample 737 finished with child-begin function.
KokkosP: sample 737 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 7 is completed.
KokkosP: sample 804 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 8
KokkosP: copy
KokkosP: sample 804 finished with child-begin function.
KokkosP: sample 804 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 8 is completed.
KokkosP: sample 849 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 9
KokkosP: copy
KokkosP: sample 849 finished with child-begin function.
KokkosP: sample 849 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 9 is completed.
KokkosP: sample 927 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 10
KokkosP: triad
KokkosP: sample 927 finished with child-begin function.
KokkosP: sample 927 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 10 is completed.
KokkosP: sample 957 calling child-begin function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 11
KokkosP: triad
KokkosP: sample 957 finished with child-begin function.
KokkosP: sample 957 calling child-end function...
KokkosP: Sampler attempting to invoke tool-induced fence on device 0.
KokkosP: Sampler sucessfully invoked tool-induced fence on device 0
KokkosP: Execution of kernel 11 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set 956.25 MB/s
Copy 1892.15 MB/s
Scale 1852.71 MB/s
Add 2750.40 MB/s
Triad 2808.33 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.
Lets add a user option to set the seed, and don't delete the erase This is put in via KOKKOS_TOOLS_RANDOM_SEED
This PR now has tests for the probability sampling and is rebased with develop.