veScale
veScale copied to clipboard
A PyTorch Native LLM Training Framework
I read the paper of megascale. And I find that the multi-node trace profiler is really useful for me. Thus I want to know how and where to use this...
I'am using the ndtimeline-tool and finding that the times for forward-compute and backward-compute are inaccurate. For the main0 stream of rank0, the compute time for both forward-compute and backward-compute appears...
In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`. However, these interfaces seem not present in the patches directory....
1. add nccl stream fetch api in pytorch patches 2. add dependency version limit about numpy and pytest in torch_patch and vescale requirements
# TL'DR  # Motivation Our current APIs for nD Parallel Training are low-level and are kind of complex for common users ... Ideally, we want a simpler API at...
hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s): ...
Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how...
# Single-Device-Abstract DDP ## Motivation In current PyTorch DDP, when training a model with Dropout operations, the final results obtained from distributed training will not be consistent with those obtained...
When I run run_open_llama_w_vescale.py with torch version 2.5.1+cu124, I met the following error: [rank4]: Traceback (most recent call last): [rank4]: File "/code/veScale/examples/open_llama_4D_benchmark/run_open_llama_w_vescale-ljx.py", line 104, in [rank4]: vescale_model = parallelize_module(model, device_mesh["TP"],...
请问有构建好的镜像吗,利用quick start里面的镜像构建步骤,一直构建失败