nnue-pytorch Use two streams, one per FT slice.

Use two streams, one per FT slice.

Open Sopel97 opened this issue 4 years ago • 4 comments

Since the computation of the two slices of the feature transformer output are independent we can try to run them on separate streams. On my GTX750 I can notice a slight performance increase with very small FT sizes, and cuda profiler shows some overlap between kernels. obraz

This may increase performance on beefier GPUs like V100, but that remains to be tested.

Note that we can in fact run two separate streams for backward too, even though they operate on the same output buffer, because all writes are atomic.

Jun 08 '21 12:06 Sopel97

that looks good to me, exposing the parallelism can only help.

Jun 08 '21 13:06 vondele

On V100 with 1 thread and 1 worker

1 thread, 1 worker:
before: 48.29 at 1000, 47.94 at 2000, 47.84 at 3000
after: 48.29 at 1000, 47.93 at 2000, 47.82 at 3000

8 threads, 4 workers:
before: 57.04 at 1000, 57.04 at 2000, 57.16 at 3000
after: 56.37 at 1000, 56.41 at 2000, 56.52 at 3000

so doesn't help, at least for now. But also doesn't do harm.

Jun 08 '21 13:06 Sopel97

we kind of now that on V100 and above it is limited by the CPU.

Jun 08 '21 13:06 vondele

I'd like to see some benchmarks from other people before pushing this.

Jun 26 '21 23:06 Sopel97

nnue-pytorch nnue-pytorch copied to clipboard

Use two streams, one per FT slice.

nnue-pytorch
nnue-pytorch copied to clipboard