Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Parallelize Meg CUDA Kernel build system

Open stas00 opened this issue 2 years ago • 0 comments

It takes forever to build the Meg cuda kernels as it does it sequentially and doesn't take advantage of multiple cores. It takes some 5 minutes to build. And every time one changes the number of gpus it rebuilds itself, which is both very non-productive and it also makes the CI really slow.

Need to rewrite the build to parallelize it.

Sidenotes: apex and deepspeed have this too, but deepspeed supports make -j

And ideally the solution needs to come from pytorch, perhaps if we solve it generically we could upstream the solution to pytorch core.

stas00 avatar Oct 29 '21 15:10 stas00