Totsu
Totsu copied to clipboard
F32CUDA seems too slow
Benchmark, profile and optimize it to speed up.
https://github.com/convexbrain/Totsu/releases/tag/totsu_f32cuda_v0.1.0
A benchmark result of LP
- https://github.com/convexbrain/Totsu/tree/1f5200599ffd8bdf15e6ce672bcc1c2f0bbc11bb/experimental/benchmark_lp
- F32CUDA is faster than FloatGeneric
.

- CPU
- Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
- RAM: 32.0 GB
- GPU
- NVIDIA GeForce RTX 3070
- CUDA core: 5888
- Core clock: 1725 MHz
- Memory bandwidth: 448.06 GB/s
- Memory: 8192 MB GDDR6
A benchmark result of QP
- https://github.com/convexbrain/Totsu/tree/884e36b4fd32d696ddca046af755ad8a2d120a61/experimental/benchmark_qp
- F32CUDA is slower than FloatGeneric
. 😭

Proceed to profiling using this benchmark.
A profiling result of QP benchmark
- Many memory accesses are occurring when projecting onto the cone.

https://github.com/convexbrain/Totsu/tree/b56407463b691a3f2418510bc43e8a72d5186fc1/experimental/benchmark_qp
- CUDA-izing projection onto cones as much as possible.
- 200 vars (100 primals, 100 duals).

- 400 vars (200 primals, 200 duals).

https://github.com/convexbrain/Totsu/tree/77f0e5cc10e7a2d29567352f88135a99ed620be1/experimental/benchmark_qp
- FxHashMap instead of HashMap.
- 200 vars (100 primals, 100 duals).

https://github.com/convexbrain/Totsu/tree/13b8d378f79445c53b9c9f77fbf4389029423d12/experimental/benchmark_qp
- Intermittent criteria checks.
- 200 vars (100 primals, 100 duals).


- The effect of CUDA comes out from about 800 variables.
- The number of iterations is not monotonically increasing; probably because those QPs are generated with random numbers.
- In the first place, the number of iterations is too large.