-enable counter-based events for regular commandlist -counter-based events may be reused even though they are not done -when ref count goes to not used by external clients value it means that event may be reused by subsequent calls -move events that are no longer externally visible to re-usable pool and reuse those more aggressively

intel/llvm PR: https://github.com/intel/llvm/pull/14754

May 31 '24 23:05 winstonzhang-intel

This does not compile /w L0 adapter enabled. Also, feel free to add a relevant benchmark scenario to https://github.com/oneapi-src/unified-runtime/blob/main/.github/scripts/compute_benchmarks.py, or just run the existing benchmark with whatever env variables are needed. You can run these from: https://github.com/oneapi-src/unified-runtime/actions/workflows/benchmarks_compute.yml

You can reach out to me if you need help or advice.

Jun 06 '24 15:06 pbalcer

@pbalcer It should compile now, working out some of the e2e tests that are still failing.

Jun 10 '24 19:06 winstonzhang-intel

@winstonzhang-intel , please link the intel/llvm PR related to this issue so we can see the full e2e test results.

Jun 14 '24 14:06 nrspruit

Compute Benchmarks level_zero run (with params: --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/9694638615

Jun 27 '24 10:06 github-actions[bot]

Compute Benchmarks level_zero run (with params: --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/9694638615 Job status: failure. Test status: skipped.

Jun 27 '24 10:06 github-actions[bot]

Compute Benchmarks level_zero run (with params: --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/9780598178

Jul 03 '24 15:07 github-actions[bot]

Compute Benchmarks level_zero run (with params: --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/9780598178 Job status: success. Test status: success.

Benchmark Results


---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl, mean execution time per 10 kernels (μs)
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=1<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)<br>Imm-CmdLists-OFF

        This PR (38.675 us)   : crit, 0, 38

        baseline (38.357 us)   :  0, 38

    -   : 0, 0

    -   : 0, 0

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=0<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)<br>Imm-CmdLists-OFF

        This PR (36.082 us)   : crit, 0, 36

        baseline (36.972 us)   :  0, 36

    -   : 0, 0

    -   : 0, 0

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=1<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)<br>

        This PR (40.549 us)   : crit, 0, 40

        baseline (41.505 us)   :  0, 41

    -   : 0, 0

    -   : 0, 0

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=0<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)<br>

        This PR (40.023 us)   : crit, 0, 40

        baseline (41.129 us)   :  0, 41

    -   : 0, 0

    -   : 0, 0

Details

SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0) Imm-CmdLists-OFF

Environment Variables:

UR_L0_USE_IMMEDIATE_COMMANDLISTS=0 UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/actions-runner/_work/unified-runtime/unified-runtime/compute-benchmarks-build/bin//api_overhead_benchmark_sycl --test=SubmitKernel --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=10000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 --csv --noHeaders

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),38.675,38.403,4.91%,37.600,206.755,[CPU],[us]

SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0) Imm-CmdLists-OFF

Environment Variables:

UR_L0_USE_IMMEDIATE_COMMANDLISTS=0 UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/actions-runner/_work/unified-runtime/unified-runtime/compute-benchmarks-build/bin//api_overhead_benchmark_sycl --test=SubmitKernel --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=10000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 --csv --noHeaders

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),36.082,36.040,2.38%,35.332,112.299,[CPU],[us]

SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

UR_L0_USE_IMMEDIATE_COMMANDLISTS=1 UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/actions-runner/_work/unified-runtime/unified-runtime/compute-benchmarks-build/bin//api_overhead_benchmark_sycl --test=SubmitKernel --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=10000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 --csv --noHeaders

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),40.549,40.484,2.12%,39.520,109.681,[CPU],[us]

SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

UR_L0_USE_IMMEDIATE_COMMANDLISTS=1 UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/actions-runner/_work/unified-runtime/unified-runtime/compute-benchmarks-build/bin//api_overhead_benchmark_sycl --test=SubmitKernel --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=10000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 --csv --noHeaders

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),40.023,39.999,2.41%,38.600,109.795,[CPU],[us]

Jul 03 '24 15:07 github-actions[bot]

Compute Benchmarks level_zero run (with params: --compare baseline --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/10055105565

Jul 23 '24 08:07 github-actions[bot]

Compute Benchmarks level_zero run (--compare baseline --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/10055105565 Job status: failure. Test status: failure.

Jul 23 '24 08:07 github-actions[bot]

easyWave_sycl -grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt -time 120

easyWave_sycl benchmark hanged with this PR.

Jul 23 '24 08:07 pbalcer

easyWave_sycl -grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt -time 120
easyWave_sycl benchmark hanged with this PR.

@pbalcer , how can one get this benchmark and run locally? That way @winstonzhang-intel can investigate the issue locally.

Jul 23 '24 23:07 nrspruit

@pbalcer getting different results on llvm/sycl test-e2e. Also confirmed locally on a PVC machine. The following tests were passing on my machine:

SYCL :: DiscardEvents/discard_events_mixed_calls.cpp
SYCL :: ESIMD/BitonicSortKv2.cpp
SYCL :: ESIMD/kmeans/kmeans.cpp
SYCL :: Graph/RecordReplay/barrier_multi_queue.cpp
SYCL :: Graph/RecordReplay/dotp_in_order.cpp
SYCL :: Graph/RecordReplay/dotp_in_order_pause.cpp
SYCL :: Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp
SYCL :: Graph/RecordReplay/dotp_multiple_queues.cpp
SYCL :: Graph/RecordReplay/host_task_in_order.cpp
SYCL :: Graph/RecordReplay/sub_graph_in_order.cpp
SYCL :: Graph/RecordReplay/usm_copy_in_order.cpp

An example output of one of the tests: $ LD_LIBRARY_PATH=/iusers/winstonz/lib/driver/:/iusers/winstonz/llvm/build/lib:$LD_LIBRARY_PATH UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 ./build/bin/llvm-lit -vv sycl/test-e2e/Graph/RecordReplay/usm_copy_in_order.cpp llvm-lit: /localdisk2/winstonz/llvm/sycl/test-e2e/lit.cfg.py:414: note: Targeted devices: all llvm-lit: /localdisk2/winstonz/llvm/sycl/test-e2e/lit.cfg.py:635: warning: Couldn't find pre-installed AOT device compiler ocloc llvm-lit: /localdisk2/winstonz/llvm/sycl/test-e2e/lit.cfg.py:635: warning: Couldn't find pre-installed AOT device compiler opencl-aot llvm-lit: /localdisk2/winstonz/llvm/sycl/test-e2e/lit.cfg.py:733: note: Aspects for level_zero:gpu: ext_oneapi_fixed_size_group, gpu, queue_profiling, ext_oneapi_bindless_images_shared_usm, ext_intel_device_id, usm_atomic_shared_allocations, ext_intel_gpu_subslices_per_slice, ext_oneapi_private_alloca, ext_intel_gpu_eu_simd_width, usm_device_allocations, ext_oneapi_bindless_images_2d_usm, ext_oneapi_graph, ext_oneapi_queue_profiling_tag, ext_oneapi_bindless_images, fp16, ext_intel_gpu_hw_threads_per_eu, online_linker, ext_oneapi_tangle_group, online_compiler, usm_host_allocations, ext_intel_memory_bus_width, ext_intel_gpu_eu_count_per_subslice, fp64, ext_intel_memory_clock_rate, ext_intel_gpu_eu_count, ext_oneapi_mipmap_anisotropy, ext_intel_device_info_uuid, ext_intel_matrix, ext_oneapi_opportunistic_group, ext_intel_pci_address, ext_oneapi_mipmap, ext_oneapi_ballot_group, ext_intel_esimd, atomic64, usm_shared_allocations, ext_oneapi_virtual_mem, ext_intel_gpu_slices, ext_oneapi_limited_graph llvm-lit: /localdisk2/winstonz/llvm/sycl/test-e2e/lit.cfg.py:745: note: SG sizes for level_zero:gpu: 16, 32 llvm-lit: /localdisk2/winstonz/llvm/sycl/test-e2e/lit.cfg.py:754: note: Architectures for level_zero:gpu: intel_gpu_pvc -- Testing: 1 tests, 1 workers -- PASS: SYCL :: Graph/RecordReplay/usm_copy_in_order.cpp (1 of 1)

Testing Time: 77.00s

Total Discovered Tests: 1 Passed: 1 (100.00%)

2 warning(s) in tests

Jul 24 '24 03:07 winstonzhang-intel

https://github.com/oneapi-src/Velocity-Bench/tree/main/easywave

You can also use our automation scripts: https://github.com/oneapi-src/unified-runtime/tree/main/scripts/benchmarks

There's no way to select a single benchmark, yet, but for now you can comment out all the benchmarks but easywave: https://github.com/oneapi-src/unified-runtime/blob/main/scripts/benchmarks/main.py#L40

As for the failing E2E tests, please create a PR on intel/llvm if you feel the fails in UR CI are incorrect.

Jul 24 '24 09:07 pbalcer

lgtm once all tests are green and the benchmarks are passing.

Just curious, why not base this PR on #1600?

1600 still have some tests that are not passing so I didn't rebase against that. Here's the CI on llvm/sycl that is all passing: https://github.com/intel/llvm/pull/14754 ^None of the tests that URT CI claims to be failing are failing on llvm/sycl CI

Jul 24 '24 22:07 winstonzhang-intel

^None of the tests that URT CI claims to be failing are failing on llvm/sycl CI

They don't have a PVC system in CI. Other PRs (see this PR) do not exhibit the same failures as this one (ignoring the address sanitizer problem that popped up yesterday). These failures seem to be unique for this PR:

  SYCL :: DiscardEvents/discard_events_mixed_calls.cpp
  SYCL :: ESIMD/BitonicSortKv2.cpp
  SYCL :: ESIMD/kmeans/kmeans.cpp
  SYCL :: Graph/RecordReplay/barrier_multi_queue.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order_pause.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp
  SYCL :: Graph/RecordReplay/dotp_multiple_queues.cpp
  SYCL :: Graph/RecordReplay/host_task_in_order.cpp
  SYCL :: Graph/RecordReplay/sub_graph_in_order.cpp
  SYCL :: Graph/RecordReplay/usm_copy_in_order.cpp

Jul 25 '24 06:07 pbalcer

Compute Benchmarks level_zero run (with params: ): https://github.com/oneapi-src/unified-runtime/actions/runs/10094246782

Jul 25 '24 12:07 github-actions[bot]

Compute Benchmarks level_zero run (): https://github.com/oneapi-src/unified-runtime/actions/runs/10094246782 Job status: failure. Test status: failure.

Jul 25 '24 12:07 github-actions[bot]

^None of the tests that URT CI claims to be failing are failing on llvm/sycl CI

They don't have a PVC system in CI. Other PRs (see this PR) do not exhibit the same failures as this one (ignoring the address sanitizer problem that popped up yesterday). These failures seem to be unique for this PR:
  SYCL :: DiscardEvents/discard_events_mixed_calls.cpp
  SYCL :: ESIMD/BitonicSortKv2.cpp
  SYCL :: ESIMD/kmeans/kmeans.cpp
  SYCL :: Graph/RecordReplay/barrier_multi_queue.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order_pause.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp
  SYCL :: Graph/RecordReplay/dotp_multiple_queues.cpp
  SYCL :: Graph/RecordReplay/host_task_in_order.cpp
  SYCL :: Graph/RecordReplay/sub_graph_in_order.cpp
  SYCL :: Graph/RecordReplay/usm_copy_in_order.cpp

I've tried at least 5 PVC machines now and none of them seems to be able to reproduce these failures.

Jul 26 '24 22:07 winstonzhang-intel

^None of the tests that URT CI claims to be failing are failing on llvm/sycl CI

They don't have a PVC system in CI. Other PRs (see this PR) do not exhibit the same failures as this one (ignoring the address sanitizer problem that popped up yesterday). These failures seem to be unique for this PR:
  SYCL :: DiscardEvents/discard_events_mixed_calls.cpp
  SYCL :: ESIMD/BitonicSortKv2.cpp
  SYCL :: ESIMD/kmeans/kmeans.cpp
  SYCL :: Graph/RecordReplay/barrier_multi_queue.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order_pause.cpp
  SYCL :: Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp
  SYCL :: Graph/RecordReplay/dotp_multiple_queues.cpp
  SYCL :: Graph/RecordReplay/host_task_in_order.cpp
  SYCL :: Graph/RecordReplay/sub_graph_in_order.cpp
  SYCL :: Graph/RecordReplay/usm_copy_in_order.cpp
I've tried at least 5 PVC machines now and none of them seems to be able to reproduce these failures.

@winstonzhang-intel , PVC runs immediate command lists by default, this functionality is for regular command lists so you need to test on GEN12, DG2, or Flex gpu.

Jul 26 '24 22:07 nrspruit

@pbalcer Seems like the e2e L0 tests are getting stuck. Could you please check that? I've also tried to run the the e2e tests locally, and they all seem to be passing. This is running on gen12 and regular commandlist should be in use: `$ bash ./test.sh llvm-lit: /home/scss_dev/workspace/llvm/sycl/test-e2e/lit.cfg.py:769: note: Architectures for opencl:gpu: intel_gpu_adl_s -- Testing: 11 tests, 11 workers -- PASS: SYCL :: Graph/RecordReplay/dotp_in_order.cpp (1 of 11) PASS: SYCL :: Graph/RecordReplay/usm_copy_in_order.cpp (2 of 11) PASS: SYCL :: Graph/RecordReplay/dotp_multiple_queues.cpp (3 of 11) PASS: SYCL :: Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp (4 of 11) PASS: SYCL :: Graph/RecordReplay/host_task_in_order.cpp (5 of 11) PASS: SYCL :: Graph/RecordReplay/dotp_in_order_pause.cpp (6 of 11) PASS: SYCL :: Graph/RecordReplay/sub_graph_in_order.cpp (7 of 11) PASS: SYCL :: Graph/RecordReplay/barrier_multi_queue.cpp (8 of 11) PASS: SYCL :: DiscardEvents/discard_events_mixed_calls.cpp (9 of 11) PASS: SYCL :: ESIMD/BitonicSortKv2.cpp (10 of 11) PASS: SYCL :: ESIMD/kmeans/kmeans.cpp (11 of 11)

Testing Time: 21.48s

Total Discovered Tests: 11 Passed: 11 (100.00%)`

Jul 30 '24 23:07 winstonzhang-intel

@pbalcer Seems like the e2e L0 tests are getting stuck.

The system we used in CI died and we haven't managed to get it back up yet.

Thanks for checking that the e2e tests are now passing. I'm not sure what was wrong with the runs in the CI (maybe a stale commit?).

Jul 31 '24 06:07 pbalcer

Compute Benchmarks level_zero run (with params: ): https://github.com/oneapi-src/unified-runtime/actions/runs/10195419517

Aug 01 '24 09:08 github-actions[bot]

Compute Benchmarks level_zero run (): https://github.com/oneapi-src/unified-runtime/actions/runs/10195419517 Job status: failure. Test status: failure.

Aug 01 '24 09:08 github-actions[bot]

CudaSift benchmark has failed:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 87 3142 2.3622% 1 2

Performing data verification 
Data verification FAILED.

This is on 1T PVC.

You can run the same benchmark by using the scripts here: $ ./main.py ~/benchmarks_workdir/ ~/llvm/build/ --filter CudaSift --iterations 1

Where benchmarks_workdir is a location where the benchmarks will be built and ~/llvm/build/ is a location of the compiler that was built with the desired UR version. See $ ./main.py --help for more options

Aug 01 '24 09:08 pbalcer

Compute Benchmarks level_zero run (with params: ): https://github.com/oneapi-src/unified-runtime/actions/runs/10305771352

Aug 08 '24 16:08 github-actions[bot]

Compute Benchmarks level_zero run (): https://github.com/oneapi-src/unified-runtime/actions/runs/10305771352 Job status: failure. Test status: failure.

Aug 08 '24 16:08 github-actions[bot]

Compute Benchmarks level_zero run (with params: --env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/10880913609

Sep 16 '24 09:09 github-actions[bot]

Compute Benchmarks level_zero run (--env UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 --env UR_L0_USE_DRIVER_INORDER_LISTS=1): https://github.com/oneapi-src/unified-runtime/actions/runs/10880913609 Job status: success. Test status: success.

Summary

result is better

Benchmark	This PR	baseline
api_overhead_benchmark_sycl SubmitKernel out of order	48.362	50.631
api_overhead_benchmark_sycl SubmitKernel in order	47.024	49.385
api_overhead_benchmark_ur SubmitKernel out of order	31.312	31.93
api_overhead_benchmark_ur SubmitKernel in order	25.546	28.586
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	424.685	423.457
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	261.384	253.906
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	10.089	9.179
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.002	1.854
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.143	4.506
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	2.096	3.613
miscellaneous_benchmark_sycl VectorSum	858.416	863.651
Velocity-Bench Hashtable	207.852567	178.291413
Velocity-Bench Bitcracker	35.6076	35.8407
Velocity-Bench CudaSift	256.843	283.294
Velocity-Bench Easywave	446	457.0
Velocity-Bench QuickSilver	90.08	115.63
Velocity-Bench Sobel Filter	985.857	934.963

Charts

api_overhead_benchmark_sycl SubmitKernel out of order

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl SubmitKernel out of order
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=0<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)

        This PR (48.362 μs)   : crit, 0, 48

        baseline (50.631 μs)   :  0, 50

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_sycl SubmitKernel in order

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl SubmitKernel in order
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=1<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)

        This PR (47.024 μs)   : crit, 0, 47

        baseline (49.385 μs)   :  0, 49

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_ur SubmitKernel out of order

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_ur SubmitKernel out of order
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=ur<br>Profiling=0<br>Ioq=0<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)

        This PR (31.312 μs)   : crit, 0, 31

        baseline (31.93 μs)   :  0, 31

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_ur SubmitKernel in order

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_ur SubmitKernel in order
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=ur<br>Profiling=0<br>Ioq=1<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)

        This PR (25.546 μs)   : crit, 0, 25

        baseline (28.586 μs)   :  0, 28

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QueueInOrderMemcpy(api=sycl<br>IsCopyOnly=0<br>sourcePlacement=Device<br>destinationPlacement=Device<br>size=1KB<br>count=100)

        This PR (424.685 μs)   : crit, 0, 424

        baseline (423.457 μs)   :  0, 423

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QueueInOrderMemcpy(api=sycl<br>IsCopyOnly=0<br>sourcePlacement=Host<br>destinationPlacement=Device<br>size=1KB<br>count=100)

        This PR (261.384 μs)   : crit, 0, 261

        baseline (253.906 μs)   :  0, 253

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QueueMemcpy(api=sycl<br>sourcePlacement=Device<br>destinationPlacement=Device<br>size=1KB)

        This PR (10.089 μs)   : crit, 0, 10

        baseline (9.179 μs)   :  0, 9

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240
    todayMarker off
    dateFormat  X
    axisFormat %s

    section StreamMemory(api=sycl<br>type=Triad<br>size=10KB<br>useEvents=0<br>contents=Zeros<br>memoryPlacement=Device)

        This PR (3.002 μs)   : crit, 0, 3

        baseline (1.854 μs)   :  0, 1

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section ExecImmediateCopyQueue(api=sycl<br>IsCopyOnly=1<br>MeasureCompletionTime=0<br>src=Device<br>dst=Device<br>size=1KB<br>ioq=0)

        This PR (2.143 μs)   : crit, 0, 2

        baseline (4.506 μs)   :  0, 4

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section ExecImmediateCopyQueue(api=sycl<br>IsCopyOnly=1<br>MeasureCompletionTime=0<br>src=Host<br>dst=Host<br>size=1KB<br>ioq=1)

        This PR (2.096 μs)   : crit, 0, 2

        baseline (3.613 μs)   :  0, 3

    -   : 0, 0

    -   : 0, 0

miscellaneous_benchmark_sycl VectorSum

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title miscellaneous_benchmark_sycl VectorSum
    todayMarker off
    dateFormat  X
    axisFormat %s

    section VectorSum(api=sycl<br>numberOfElementsX=512<br>numberOfElementsY=256<br>numberOfElementsZ=256)

        This PR (858.416 μs)   : crit, 0, 858

        baseline (863.651 μs)   :  0, 863

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Hashtable

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Hashtable
    todayMarker off
    dateFormat  X
    axisFormat %s

    section hashtable

        This PR (207.852567 M keys/sec)   : crit, 0, 207

        baseline (178.291413 M keys/sec)   :  0, 178

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Bitcracker

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Bitcracker
    todayMarker off
    dateFormat  X
    axisFormat %s

    section bitcracker

        This PR (35.6076 s)   : crit, 0, 35

        baseline (35.8407 s)   :  0, 35

    -   : 0, 0

    -   : 0, 0

Velocity-Bench CudaSift

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench CudaSift
    todayMarker off
    dateFormat  X
    axisFormat %s

    section cudaSift

        This PR (256.843 ms)   : crit, 0, 256

        baseline (283.294 ms)   :  0, 283

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Easywave

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Easywave
    todayMarker off
    dateFormat  X
    axisFormat %s

    section easywave

        This PR (446 ms)   : crit, 0, 446

        baseline (457.0 ms)   :  0, 457

    -   : 0, 0

    -   : 0, 0

Velocity-Bench QuickSilver

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench QuickSilver
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QuickSilver

        This PR (90.08 MMS/CTT)   : crit, 0, 90

        baseline (115.63 MMS/CTT)   :  0, 115

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Sobel Filter

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Sobel Filter
    todayMarker off
    dateFormat  X
    axisFormat %s

    section sobel_filter

        This PR (985.857 ms)   : crit, 0, 985

        baseline (934.963 ms)   :  0, 934

    -   : 0, 0

    -   : 0, 0

Details

SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),48.362,47.646,7.34%,43.188,547.322,[CPU],[us]

SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),47.024,46.508,6.65%,44.278,209.617,[CPU],[us]

SubmitKernel(api=ur Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=ur Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),31.312,31.050,6.53%,29.597,503.558,[CPU],[us]

SubmitKernel(api=ur Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=ur Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),25.546,29.884,27.77%,13.324,230.644,[CPU],[us]

QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100),424.685,467.871,19.83%,246.890,870.042,[CPU],[us]

QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100),261.384,238.517,22.09%,230.359,746.004,[CPU],[us]

QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB),10.089,9.944,18.73%,7.751,150.687,[CPU],[us]

StreamMemory(api=sycl type=Triad size=10KB useEvents=0 contents=Zeros memoryPlacement=Device)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type StreamMemory(api=sycl type=Triad size=10KB useEvents=0 contents=Zeros memoryPlacement=Device),3.002,3.081,6.77%,0.382,3.365,[CPU],[GB/s]

ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0),2.143,2.101,14.10%,1.894,75.835,[CPU],[us]

ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1),2.096,1.670,45.10%,1.554,28.530,[CPU],[us]

VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256)

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256),858.416,858.902,0.49%,821.607,879.002,[GPU],bw [GB/s]

hashtable

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/hashtable/hashtable_sycl --no-verify

Output:

hashtable - total time for whole calculation: 0.645735 s 207.852567 million keys/second

bitcracker

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/bitcracker/bitcracker -f /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Output:

---------> BitCracker: BitLocker password cracking tool <---------

================================== Retrieving Info

Reading hash file "/home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================ Type of attack: User Password Psw per thread: 1 max_num_pswd_per_read: 60000 Dictionary: /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000 Kernel execution: Effective passwords: 60000 Passwords Range: npknpByH7N2m3OnLNH1X9DJxLrzIFWk ..... dL_7uuf3QCz-c6K3xDu0

================================================ Bitcracker attack completed Total passwords evaluated: 60000 Password not found!

time to subtract from total: 0.0101897 s bitcracker - total time for whole calculation: 35.6076 s

cudaSift

Environment Variables:

UR_L0_USE_DRIVER_COUNTER_BASED_EVENTS=1 UR_L0_USE_DRIVER_INORDER_LISTS=1

Command:

/home/test-user/bench_workdir/cudaSift/cudaSift

Output:

Image size = (1920,1080) Initializing data... Number of original features: 3683 3933 Number of matching features: 1185 1247 32.1749% 1 2