Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Parallelize Meg CUDA Kernel build system
It takes forever to build the Meg cuda kernels as it does it sequentially and doesn't take advantage of multiple cores. It takes some 5 minutes to build. And every time one changes the number of gpus it rebuilds itself, which is both very non-productive and it also makes the CI really slow.
Need to rewrite the build to parallelize it.
Sidenotes: apex and deepspeed have this too, but deepspeed supports make -j
And ideally the solution needs to come from pytorch, perhaps if we solve it generically we could upstream the solution to pytorch core.