Less Wright
Less Wright
Hi @leeeizhang, Re: -1 for loss - when using Pipeline Parallel, only the last stage computes an actual loss. Thus, you will need to adjust your logging filter to include...
Re: memory keeps increasing: Based on the screenshot, it only shows two iterations displayed, iter 1 and iter 10. Normally, memory will always increase from iter 1 to iter 2,...
closing as this has all been ported forward to the new built in native parallel approach.
installing triton directly first, simply results in the script uninstalling triton: ~~~ Found existing installation: triton 2.3.1 Uninstalling triton-2.3.1: Successfully uninstalled triton-2.3.1 ~~~ and then failing out with the same...
note that starting in clean environment succeeds. Not sure if you want to just chalk it up to complicated environment regarding not patching triton, or try to robustify the patch_triton.py.
@ngimel - thanks for info above! To your questions: 1 - "This optimizes performance for an extremely common function, and as such should go into pytorch core and not into...
Hi @EugenHotaj - yes a full DS v3 training script is coming. The current PR's are part of an iterative process...more is coming soon!
Hi @janEbert - nice, thanks for the update and yes would definitely be happy to get an external PR from you on this. I have been busy with groupGEMM and...
just wanted to add that if we do want to support this, then longer term it may be a lot more performant to re-purpose Ke's CUDA kernel as the nan...
Hi @lxww302 - that branch is working for tp2ep. I hit some issues re: dp2ep, so I would not use that yet, but tp2ep was working perfectly in my brief...