jellyfish
jellyfish copied to clipboard
[VID] batch commit with GPU is unexpectedly slow
Check this branch https://github.com/EspressoSystems/jellyfish/tree/cl/gpu-profiling
Running with cargo test --features gpu-vid,kzg-print-trace,print-trace -p jf-primitives -- profile_gpu_commit --nocapture gives you the following result. You can see the performance degrading with increased batch size.
However according to cargo bench --bench kzg-gpu --features "test-srs icicle", MSM should only cost you [28.107 ms 28.438 ms 28.988 ms]
Start: KZG10::Setup with prover degree 1048576 and verifier degree 1
··Start: Generating powers of G
··End: Generating powers of G ..................................................8.384s
End: KZG10::Setup with prover degree 1048576 and verifier degree 1 .............8.769s
Start: Type Conversion: ark->ICICLE: Group
End: Type Conversion: ark->ICICLE: Group .......................................9.590ms
Start: Load group elements: CPU->GPU
End: Load group elements: CPU->GPU .............................................7.521ms
Start: Batch commit 1048576 total elements, batch size 1
··Start: Type Conversion: ark->ICICLE: Scalar
··End: Type Conversion: ark->ICICLE: Scalar ....................................23.156ms
··Start: Load scalars: CPU->GPU
··End: Load scalars: CPU->GPU ..................................................2.502ms
··Start: GPU-accelerated MSM
··End: GPU-accelerated MSM .....................................................22.853ms
··Start: Sync MSM result
··End: Sync MSM result .........................................................11.730ms
··Start: Load MSM result GPU->CPU
··End: Load MSM result GPU->CPU ................................................52.750µs
··Start: Type Conversion: ICICLE->ark: Group
··End: Type Conversion: ICICLE->ark: Group .....................................182.968µs
End: Batch commit 1048576 total elements, batch size 1 .........................61.846ms
Start: Batch commit 1048576 total elements, batch size 8
··Start: Type Conversion: ark->ICICLE: Scalar
··End: Type Conversion: ark->ICICLE: Scalar ....................................24.932ms
··Start: Load scalars: CPU->GPU
··End: Load scalars: CPU->GPU ..................................................2.627ms
··Start: GPU-accelerated MSM
··End: GPU-accelerated MSM .....................................................22.863ms
··Start: Sync MSM result
··End: Sync MSM result .........................................................27.982ms
··Start: Load MSM result GPU->CPU
··End: Load MSM result GPU->CPU ................................................49.570µs
··Start: Type Conversion: ICICLE->ark: Group
··End: Type Conversion: ICICLE->ark: Group .....................................682.948µs
End: Batch commit 1048576 total elements, batch size 8 .........................80.681ms
Start: Batch commit 1048576 total elements, batch size 16
··Start: Type Conversion: ark->ICICLE: Scalar
··End: Type Conversion: ark->ICICLE: Scalar ....................................22.681ms
··Start: Load scalars: CPU->GPU
··End: Load scalars: CPU->GPU ..................................................5.194ms
··Start: GPU-accelerated MSM
··End: GPU-accelerated MSM .....................................................98.494ms
··Start: Sync MSM result
··End: Sync MSM result .........................................................49.478ms
··Start: Load MSM result GPU->CPU
··End: Load MSM result GPU->CPU ................................................109.749µs
··Start: Type Conversion: ICICLE->ark: Group
··End: Type Conversion: ICICLE->ark: Group .....................................865.120µs
End: Batch commit 1048576 total elements, batch size 16 ........................178.481ms
Start: Batch commit 1048576 total elements, batch size 256
··Start: Type Conversion: ark->ICICLE: Scalar
··End: Type Conversion: ark->ICICLE: Scalar ....................................23.140ms
··Start: Load scalars: CPU->GPU
··End: Load scalars: CPU->GPU ..................................................10.269ms
··Start: GPU-accelerated MSM
··End: GPU-accelerated MSM .....................................................180.192ms
··Start: Sync MSM result
··End: Sync MSM result .........................................................61.028ms
··Start: Load MSM result GPU->CPU
··End: Load MSM result GPU->CPU ................................................260.128µs
··Start: Type Conversion: ICICLE->ark: Group
··End: Type Conversion: ICICLE->ark: Group .....................................3.137ms
End: Batch commit 1048576 total elements, batch size 256 .......................279.902ms
Start: Batch commit 1048576 total elements, batch size 1024
··Start: Type Conversion: ark->ICICLE: Scalar
··End: Type Conversion: ark->ICICLE: Scalar ....................................24.463ms
··Start: Load scalars: CPU->GPU
··End: Load scalars: CPU->GPU ..................................................2.960ms
··Start: GPU-accelerated MSM
··End: GPU-accelerated MSM .....................................................60.377ms
··Start: Sync MSM result
··End: Sync MSM result .........................................................64.456ms
··Start: Load MSM result GPU->CPU
··End: Load MSM result GPU->CPU ................................................159.259µs
··Start: Type Conversion: ICICLE->ark: Group
··End: Type Conversion: ICICLE->ark: Group .....................................12.733ms
End: Batch commit 1048576 total elements, batch size 1024 ......................167.020ms
Start: Batch commit 1048576 total elements, batch size 4096
··Start: Type Conversion: ark->ICICLE: Scalar
··End: Type Conversion: ark->ICICLE: Scalar ....................................23.588ms
··Start: Load scalars: CPU->GPU
··End: Load scalars: CPU->GPU ..................................................5.343ms
··Start: GPU-accelerated MSM
··End: GPU-accelerated MSM .....................................................198.382ms
··Start: Sync MSM result
··End: Sync MSM result .........................................................39.863ms
··Start: Load MSM result GPU->CPU
··End: Load MSM result GPU->CPU ................................................200.608µs
··Start: Type Conversion: ICICLE->ark: Group
··End: Type Conversion: ICICLE->ark: Group .....................................41.532ms
End: Batch commit 1048576 total elements, batch size 4096 ......................311.326ms
test pcs::univariate_kzg::tests::icicle::profile_gpu_commit ... ok
The other weird observation is, if you do not warmup a new cuda stream per run, the performance is also not good even with 1 batch. cc @alxiong
Also with one single batch, the type conversion takes one-third of the time. We could eliminate that #516