ghostplant
ghostplant
> export FAST_CUMSUM=0 Have you tried: `export FAST_CUMSUM=0`
I don't suggest you install cuda toolkit over default Ubuntu repository, as they are too old. You should follow the instruction here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu After CUDA SDK is successfully, please purge...
Hi, thanks for your info. According to tracing, this is not a bug, but your code doesn't use it in a correct way: CUDA's evaluation from your code is based...
Hi, thanks for reporting this issue. For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as...
This is an improper environment configuration not recognized by pytorch. Can you make a copy of nccl.h to /usr/include, and a copy of libnccl.so to /usr/lib/x86_64-linux-gnu? (Softlink is also fine.)
Do you mean Megatron and Deepspeed respectively, or working together for them all?
Yes, Tutel is just an MoE layer implementation which is pluggable for any distributed frameworks. The way for other framework to use Tutel MoE layer is by passing distributed processing...
Can you explain why "x == y" for `y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)`?
Can you set `gate_noise = 0` for both and check if they produce the same results?
OK, can you help to provides these things? For both solutions, please add the following codes after `y = fast_encode(..)`: In example code: ```py ... torch.save([x, crit, y], 'test_cast_example.py') ```...