Less Wright comments

Results 81 comments of


                                            Less Wright

[Bug] Loss=-1.0 and GPU memory keeps increasing in pipeline parallel

Hi @leeeizhang, Re: -1 for loss - when using Pipeline Parallel, only the last stage computes an actual loss. Thus, you will need to adjust your logging filter to include...

[Bug] Loss=-1.0 and GPU memory keeps increasing in pipeline parallel

Re: memory keeps increasing: Based on the screenshot, it only shows two iterations displayed, iter 1 and iter 10. Normally, memory will always increase from iter 1 to iter 2,...

[WIP][Draft only]. Integrate tune model arch, load weights via safetensors, add a lot of graph and model debug and tracing utils

closing as this has all been ported forward to the new built in native parallel approach.

requirements setup fails to install/configure triton properly, yielding broken install

installing triton directly first, simply results in the script uninstalling triton: ~~~ Found existing installation: triton 2.3.1 Uninstalling triton-2.3.1: Successfully uninstalled triton-2.3.1 ~~~ and then failing out with the same...

requirements setup fails to install/configure triton properly, yielding broken install

note that starting in clean environment succeeds. Not sure if you want to just chalk it up to complicated environment regarding not patching triton, or try to robustify the patch_triton.py.

[DeepSeek][kernels] index select permute, cuda

@ngimel - thanks for info above! To your questions: 1 - "This optimizes performance for an extremely common function, and as such should go into pytorch core and not into...

End-to-end training of DeepSeek V3

Hi @EugenHotaj - yes a full DS v3 training script is coming. The current PR's are part of an iterative process...more is coming soon!

[Feature] Add Multi-Token Prediction module

Hi @janEbert - nice, thanks for the update and yes would definitely be happy to get an external PR from you on this. I have been busy with groupGEMM and...

[Feature] Support skipping bad grad updates

just wanted to add that if we do want to support this, then longer term it may be a lot more performant to re-purpose Ke's CUDA kernel as the nan...

DeepSeek V3 Support

Hi @lxww302 - that branch is working for tp2ep. I hit some issues re: dp2ep, so I would not use that yet, but tp2ep was working perfectly in my brief...