veScale issues

[QUESTION] how and where to use multi-node trace profiler in paper of megascale

1

I read the paper of megascale. And I find that the multi-node trace profiler is really useful for me. Thus I want to know how and where to use this...

oliverYoung2001

The times for forward-compute and backward-compute captured by the ndtimeline-tool are inaccurate

4

I'am using the ndtimeline-tool and finding that the times for forward-compute and backward-compute are inaccurate. For the main0 stream of rank0, the compute time for both forward-compute and backward-compute appears...

zmtttt

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`

4

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`. However, these interfaces seem not present in the patches directory....

nooblyh

enhancement

question

feat: add nccl stream fetch api and add dependency version limit

1

1. add nccl stream fetch api in pytorch patches 2. add dependency version limit about numpy and pytest in torch_patch and vescale requirements

vocaltract

[RFC] veScale: High-Level API for nD Parallel Training

# TL'DR ![tldr](https://github.com/volcengine/veScale/assets/16678974/4619d84a-e0a6-46fa-9dea-f7c6509ae496) # Motivation Our current APIs for nD Parallel Training are low-level and are kind of complex for common users ... Ideally, we want a simpler API at...

leonardo0lyj

rfc

[QUESTION] questions about Collective Communication Group Initialization Optimization in the paper

hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s): ![image](https://github.com/volcengine/veScale/assets/173707402/f5b024e3-dea4-49a5-9e72-ce8e80193d9a)...

siddharthaOnRoad

question

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment

4

Does it support Muti-Machine and Muti-GPU to use ndtimeline？？ Now，I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine？？ how...

zmtttt

[RFC] Single-Device-Abstract DDP

1

# Single-Device-Abstract DDP ## Motivation In current PyTorch DDP, when training a model with Dropout operations, the final results obtained from distributed training will not be consistent with those obtained...

lllukehuang

rfc

[QUESTION] torch broadcast error

1

When I run run_open_llama_w_vescale.py with torch version 2.5.1+cu124, I met the following error: [rank4]: Traceback (most recent call last): [rank4]: File "/code/veScale/examples/open_llama_4D_benchmark/run_open_llama_w_vescale-ljx.py", line 104, in [rank4]: vescale_model = parallelize_module(model, device_mesh["TP"],...

sallyjunjun

[QUESTION]

请问有构建好的镜像吗，利用quick start里面的镜像构建步骤，一直构建失败

liu-yx17

veScale
veScale copied to clipboard

Metadata

[QUESTION] how and where to use multi-node trace profiler in paper of megascale

The times for forward-compute and backward-compute captured by the ndtimeline-tool are inaccurate

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`

feat: add nccl stream fetch api and add dependency version limit

[RFC] veScale: High-Level API for nD Parallel Training

[QUESTION] questions about Collective Communication Group Initialization Optimization in the paper

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment

[RFC] Single-Device-Abstract DDP

[QUESTION] torch broadcast error

[QUESTION]

← Metadata

Owner

Metadata

veScale veScale copied to clipboard

Metadata

← Metadata

Owner

Metadata

veScale
veScale copied to clipboard