nnue-pytorch
nnue-pytorch copied to clipboard
Use two streams, one per FT slice.
Since the computation of the two slices of the feature transformer output are independent we can try to run them on separate streams. On my GTX750 I can notice a slight performance increase with very small FT sizes, and cuda profiler shows some overlap between kernels.

This may increase performance on beefier GPUs like V100, but that remains to be tested.
Note that we can in fact run two separate streams for backward too, even though they operate on the same output buffer, because all writes are atomic.
that looks good to me, exposing the parallelism can only help.
On V100 with 1 thread and 1 worker
1 thread, 1 worker:
before: 48.29 at 1000, 47.94 at 2000, 47.84 at 3000
after: 48.29 at 1000, 47.93 at 2000, 47.82 at 3000
8 threads, 4 workers:
before: 57.04 at 1000, 57.04 at 2000, 57.16 at 3000
after: 56.37 at 1000, 56.41 at 2000, 56.52 at 3000
so doesn't help, at least for now. But also doesn't do harm.
we kind of now that on V100 and above it is limited by the CPU.
I'd like to see some benchmarks from other people before pushing this.