veScale icon indicating copy to clipboard operation
veScale copied to clipboard

A PyTorch Native LLM Training Framework

Results 17 veScale issues
Sort by recently updated
recently updated
newest added

I read the paper of megascale. And I find that the multi-node trace profiler is really useful for me. Thus I want to know how and where to use this...

I'am using the ndtimeline-tool and finding that the times for forward-compute and backward-compute are inaccurate. For the main0 stream of rank0, the compute time for both forward-compute and backward-compute appears...

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`. However, these interfaces seem not present in the patches directory....

enhancement
question

1. add nccl stream fetch api in pytorch patches 2. add dependency version limit about numpy and pytest in torch_patch and vescale requirements

# TL'DR ![tldr](https://github.com/volcengine/veScale/assets/16678974/4619d84a-e0a6-46fa-9dea-f7c6509ae496) # Motivation Our current APIs for nD Parallel Training are low-level and are kind of complex for common users ... Ideally, we want a simpler API at...

rfc

hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s): ![image](https://github.com/volcengine/veScale/assets/173707402/f5b024e3-dea4-49a5-9e72-ce8e80193d9a)...

question

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how...

# Single-Device-Abstract DDP ## Motivation In current PyTorch DDP, when training a model with Dropout operations, the final results obtained from distributed training will not be consistent with those obtained...

rfc

When I run run_open_llama_w_vescale.py with torch version 2.5.1+cu124, I met the following error: [rank4]: Traceback (most recent call last): [rank4]: File "/code/veScale/examples/open_llama_4D_benchmark/run_open_llama_w_vescale-ljx.py", line 104, in [rank4]: vescale_model = parallelize_module(model, device_mesh["TP"],...

请问有构建好的镜像吗,利用quick start里面的镜像构建步骤,一直构建失败