veScale The times for forward-compute and backward-compute captured by the ndtimeline-tool are inaccurate

I'am using the ndtimeline-tool and finding that the times for forward-compute and backward-compute are inaccurate. For the main0 stream of rank0, the compute time for both forward-compute and backward-compute appears to be decreasing over time. This trend is unexpected and suggests that there may be an issue with the captured timing data or the underlying computation process itself.

the test lies in /veScale/test/parallel/pipeline/instruction# python test_schedule.py -k test_runtime_engine_with_profiling batches=128， using FourMlp as the model before processing @Meteorix @MingjiHan99 @pengyanghua @MackZackA

Aug 20 '24 06:08 zmtttt

This revised statement clearly communicates that you have made improvements to the script, but there are still problems with the captured timing data for rank0 and rank2.... like this: the forward propagation and backward propagation times for ranks 0 and 2 during the first few iterations of training are inconsistent with later iterations question

Aug 21 '24 01:08 zmtttt

Thank you @zmtttt for your interest in veScale! If I understand your question correctly, you expressed concern about the visible gaps in profiled overheads between the first few iterations and their later counterparts. Scenarios like this are expected. At the inception of training, factors including dynamic allocations of caches, and warming up of third-party operators contribute to larger overheads in earlier iterations. Effective communication bandwidth that varies over time can also result in different readings. A common practice to obtain consistent readings is gathering data after 5-10 iterations, as computing and communication time become stable.

Aug 21 '24 05:08 MackZackA

Thank you @zmtttt for your interest in veScale! If I understand your question correctly, you expressed concern about the visible gaps in profiled overheads between the first few iterations and their later counterparts. Scenarios like this are expected. At the inception of training, factors including dynamic allocations of caches, and warming up of third-party operators contribute to larger overheads in earlier iterations. Effective communication bandwidth that varies over time can also result in different readings. A common practice to obtain consistent readings is gathering data after 5-10 iterations, as computing and communication time become stable.

Thank you for your detailed explanation and insights into veScale! You're right that it's typical to see variations in the overheads during the initial iterations of training. Factors such as cache warm-up, initialization of third-party operators, and other system-level activities can indeed contribute to these discrepancies.

My concern, however, is that it seems to take a significant number of iterations before the forward and backward computations stabilize. This is different from what I've experienced in other scenarios where the overheads typically stabilize after just a few iterations (usually around 5-10). In my case, the stabilization point appears to be much later, which raises some questions about the efficiency of the setup.

Moreover, I've observed that the issues vary significantly across different runs of the same distributed training code. When analyzing the trace files generated after each run, the problematic areas seem to shift or change, making it challenging to identify a consistent pattern or source of inefficiency. Some runs stabilize after only 5-10 iterations, while others require more than 50 iterations to stabilize. Even more puzzling, some runs appear stable initially but then become unstable later on. This variability makes it difficult to optimize the training process.

Like this：

Aug 21 '24 06:08 zmtttt

Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct，but step=20 and step30 seems incorrect.

Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup.

Step= 2； Step =2 Step = 20； step=20 Step = 30； step=30

Aug 22 '24 03:08 zmtttt

Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct，but step=20 and step30 seems incorrect.

Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup.

Step= 2； Step = 20； Step = 30；

Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after flush and maybe slowing down cuda kernels launching due to GIL

Aug 30 '24 06:08 vocaltract

Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct，but step=20 and step30 seems incorrect. Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup. Step= 2； Step = 20； Step = 30；

Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after flush and maybe slowing down cuda kernels launching due to GIL

ok,thanks for your answer! this is my code : @with_comms def test_runtime_engine_with_profiling(self): print("test_runtime_engine_with_profiling") """ Tests runtime engine with distributed nD timeline profiling. """ # initialize global device mesh-3d VESCALE_DEVICE_MESH.init_device_mesh( device_type="cuda", mesh_shape=(4, 1, 1), mesh_dim_names=("PP", "DP", "TP"), ) global local_rank
local_rank = self.rank device = f"cuda:{local_rank}" # must do this: https://pytorch.org/docs/stable/distributed.html torch.cuda.set_device(device) os.environ["LOCAL_RANK"] = str(local_rank)

    from vescale.ndtimeline import init_ndtimers, flush, wait
    init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
    
    #FourMLP-Model
    n_hidden = 3
    batches = 128
    model = FourMLP(n_hidden)
    all_batches_out = []

  
    fwd_plan = {
        ".input": [[Replicate()]],#
        ".output": [[Replicate()]], # 
    }

    model_list = []

    tp_mesh = VESCALE_DEVICE_MESH.get_tensor_parallel_mesh()
    if local_rank == 0:
        model.mlp1 = parallelize_module(model.mlp1, tp_mesh, {"parameter": None, "forward": fwd_plan}) # parameter为并行优化参数，
        model_list = [model.mlp1]
    elif self.rank == 1:
        model.mlp2 = parallelize_module(model.mlp2, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp2]
    elif self.rank == 2:
        model.mlp3 = parallelize_module(model.mlp3, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp3]
    elif self.rank == 3:
        model.mlp4 = parallelize_module(model.mlp4, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp4]


    deps = get_linear_pp_module_dep2(model_list, VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes())
    data_iterator = []
    for i in range(batches):
        data = torch.zeros(1, 1, n_hidden) + i 
        data_iterator.append(distribute_tensor(data.float(), tp_mesh, placements=[Replicate()]))
    pipe_engine = ScheduleEngine(
        deps=deps,
        meshes=VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes(),
        schedule=PipelineScheduleType.SIMPLE_1F1B,
        batches=batches,
        data_iterator=data_iterator,
        stage_id=local_rank,
        shape=(1, 1, 3),
        dtype=torch.float32,
    )

    n_steps = 30
    for step in range(n_steps):
        print(f"Step {step + 1}:")
        if local_rank == 3:
            model.zero_grad(set_to_none=True)

        _, all_forward = ScheduleEngine.execute(pipe_engine)
        if local_rank == 3:
            loss_per_microbatch = [item[1] for item in all_forward]
            for t1, t2 in zip(loss_per_microbatch, all_batches_out):
                pass
            all_batches_out = loss_per_microbatch
        flush()
        wait()
    if local_rank == 3:
        print("nD timeline analysis for step 10 completed.")

Aug 30 '24 06:08 zmtttt

Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct，but step=20 and step30 seems incorrect. Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup. Step= 2； Step = 20； Step = 30；

Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after flush and maybe slowing down cuda kernels launching due to GIL

ok,thanks for your answer! this is my code : @with_comms def test_runtime_engine_with_profiling(self): print("test_runtime_engine_with_profiling") """ Tests runtime engine with distributed nD timeline profiling. """ # initialize global device mesh-3d VESCALE_DEVICE_MESH.init_device_mesh( device_type="cuda", mesh_shape=(4, 1, 1), mesh_dim_names=("PP", "DP", "TP"), ) global local_rank local_rank = self.rank device = f"cuda:{local_rank}" # must do this: https://pytorch.org/docs/stable/distributed.html torch.cuda.set_device(device) os.environ["LOCAL_RANK"] = str(local_rank)
    from vescale.ndtimeline import init_ndtimers, flush, wait
    init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
    
    #FourMLP-Model
    n_hidden = 3
    batches = 128
    model = FourMLP(n_hidden)
    all_batches_out = []

  
    fwd_plan = {
        ".input": [[Replicate()]],#
        ".output": [[Replicate()]], # 
    }

    model_list = []

    tp_mesh = VESCALE_DEVICE_MESH.get_tensor_parallel_mesh()
    if local_rank == 0:
        model.mlp1 = parallelize_module(model.mlp1, tp_mesh, {"parameter": None, "forward": fwd_plan}) # parameter为并行优化参数，
        model_list = [model.mlp1]
    elif self.rank == 1:
        model.mlp2 = parallelize_module(model.mlp2, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp2]
    elif self.rank == 2:
        model.mlp3 = parallelize_module(model.mlp3, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp3]
    elif self.rank == 3:
        model.mlp4 = parallelize_module(model.mlp4, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp4]


    deps = get_linear_pp_module_dep2(model_list, VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes())
    data_iterator = []
    for i in range(batches):
        data = torch.zeros(1, 1, n_hidden) + i 
        data_iterator.append(distribute_tensor(data.float(), tp_mesh, placements=[Replicate()]))
    pipe_engine = ScheduleEngine(
        deps=deps,
        meshes=VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes(),
        schedule=PipelineScheduleType.SIMPLE_1F1B,
        batches=batches,
        data_iterator=data_iterator,
        stage_id=local_rank,
        shape=(1, 1, 3),
        dtype=torch.float32,
    )

    n_steps = 30
    for step in range(n_steps):
        print(f"Step {step + 1}:")
        if local_rank == 3:
            model.zero_grad(set_to_none=True)

        _, all_forward = ScheduleEngine.execute(pipe_engine)
        if local_rank == 3:
            loss_per_microbatch = [item[1] for item in all_forward]
            for t1, t2 in zip(loss_per_microbatch, all_batches_out):
                pass
            all_batches_out = loss_per_microbatch
        flush()
        wait()
    if local_rank == 3:
        print("nD timeline analysis for step 10 completed.")

There are two misuse of ndtimeline.

wrong arguments local

init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)

unnecessary call of wait

    for step in range(n_steps):
        print(f"Step {step + 1}:")
        if local_rank == 3:
            model.zero_grad(set_to_none=True)
        _, all_forward = ScheduleEngine.execute(pipe_engine)
        if local_rank == 3:
            loss_per_microbatch = [item[1] for item in all_forward]
            for t1, t2 in zip(loss_per_microbatch, all_batches_out):
                pass
            all_batches_out = loss_per_microbatch
        flush()
        wait()

I think misuse of wait() is the root of this phenomenon You should move wait() to the end of this file, which should be called only once.

Aug 30 '24 07:08 vocaltract

Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct，but step=20 and step30 seems incorrect. Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup. Step= 2； Step = 20； Step = 30；

Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after flush and maybe slowing down cuda kernels launching due to GIL

ok,thanks for your answer! this is my code : @with_comms def test_runtime_engine_with_profiling(self): print("test_runtime_engine_with_profiling") """ Tests runtime engine with distributed nD timeline profiling. """ # initialize global device mesh-3d VESCALE_DEVICE_MESH.init_device_mesh( device_type="cuda", mesh_shape=(4, 1, 1), mesh_dim_names=("PP", "DP", "TP"), ) global local_rank local_rank = self.rank device = f"cuda:{local_rank}" # must do this: https://pytorch.org/docs/stable/distributed.html torch.cuda.set_device(device) os.environ["LOCAL_RANK"] = str(local_rank)
    from vescale.ndtimeline import init_ndtimers, flush, wait
    init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
    
    #FourMLP-Model
    n_hidden = 3
    batches = 128
    model = FourMLP(n_hidden)
    all_batches_out = []

  
    fwd_plan = {
        ".input": [[Replicate()]],#
        ".output": [[Replicate()]], # 
    }

    model_list = []

    tp_mesh = VESCALE_DEVICE_MESH.get_tensor_parallel_mesh()
    if local_rank == 0:
        model.mlp1 = parallelize_module(model.mlp1, tp_mesh, {"parameter": None, "forward": fwd_plan}) # parameter为并行优化参数，
        model_list = [model.mlp1]
    elif self.rank == 1:
        model.mlp2 = parallelize_module(model.mlp2, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp2]
    elif self.rank == 2:
        model.mlp3 = parallelize_module(model.mlp3, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp3]
    elif self.rank == 3:
        model.mlp4 = parallelize_module(model.mlp4, tp_mesh, {"parameter": None, "forward": fwd_plan})
        model_list = [model.mlp4]


    deps = get_linear_pp_module_dep2(model_list, VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes())
    data_iterator = []
    for i in range(batches):
        data = torch.zeros(1, 1, n_hidden) + i 
        data_iterator.append(distribute_tensor(data.float(), tp_mesh, placements=[Replicate()]))
    pipe_engine = ScheduleEngine(
        deps=deps,
        meshes=VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes(),
        schedule=PipelineScheduleType.SIMPLE_1F1B,
        batches=batches,
        data_iterator=data_iterator,
        stage_id=local_rank,
        shape=(1, 1, 3),
        dtype=torch.float32,
    )

    n_steps = 30
    for step in range(n_steps):
        print(f"Step {step + 1}:")
        if local_rank == 3:
            model.zero_grad(set_to_none=True)

        _, all_forward = ScheduleEngine.execute(pipe_engine)
        if local_rank == 3:
            loss_per_microbatch = [item[1] for item in all_forward]
            for t1, t2 in zip(loss_per_microbatch, all_batches_out):
                pass
            all_batches_out = loss_per_microbatch
        flush()
        wait()
    if local_rank == 3:
        print("nD timeline analysis for step 10 completed.")
There are two misuse of ndtimeline.

wrong arguments local
init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
unnecessary call of wait
    for step in range(n_steps):
        print(f"Step {step + 1}:")
        if local_rank == 3:
            model.zero_grad(set_to_none=True)
        _, all_forward = ScheduleEngine.execute(pipe_engine)
        if local_rank == 3:
            loss_per_microbatch = [item[1] for item in all_forward]
            for t1, t2 in zip(loss_per_microbatch, all_batches_out):
                pass
            all_batches_out = loss_per_microbatch
        flush()
        wait()
I think misuse of wait() is the root of this phenomenon You should move wait() to the end of this file, which should be called only once.

thanks！ but init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True) is original code...

Aug 30 '24 07:08 zmtttt

I will fix this typo in test file later

Aug 30 '24 07:08 vocaltract

I will fix this typo in test file later

OK, thanks!

Aug 30 '24 08:08 zmtttt

Hello, @zmtttt

I'm new to vescale, and I see you have experience running training codes with 2D or higher dimensions.

I'm facing a lot of difficulties with PP+DP training—would you be willing to share any of your experiences?

Any comments or insights would be really helpful.

Best regards, Taekyoung

Jan 20 '25 11:01 taekyounghan