The times for forward-compute and backward-compute captured by the ndtimeline-tool are inaccurate
I'am using the ndtimeline-tool and finding that the times for forward-compute and backward-compute are inaccurate. For the main0 stream of rank0, the compute time for both forward-compute and backward-compute appears to be decreasing over time. This trend is unexpected and suggests that there may be an issue with the captured timing data or the underlying computation process itself.
the test lies in /veScale/test/parallel/pipeline/instruction# python test_schedule.py -k test_runtime_engine_with_profiling
batches=128, using FourMlp as the model before processing
@Meteorix @MingjiHan99 @pengyanghua @MackZackA
This revised statement clearly communicates that you have made improvements to the script,
but there are still problems with the captured timing data for rank0 and rank2....
like this: the forward propagation and backward propagation times for ranks 0 and 2 during the first few iterations of training are inconsistent with later iterations
Thank you @zmtttt for your interest in veScale! If I understand your question correctly, you expressed concern about the visible gaps in profiled overheads between the first few iterations and their later counterparts. Scenarios like this are expected. At the inception of training, factors including dynamic allocations of caches, and warming up of third-party operators contribute to larger overheads in earlier iterations. Effective communication bandwidth that varies over time can also result in different readings. A common practice to obtain consistent readings is gathering data after 5-10 iterations, as computing and communication time become stable.
Thank you @zmtttt for your interest in veScale! If I understand your question correctly, you expressed concern about the visible gaps in profiled overheads between the first few iterations and their later counterparts. Scenarios like this are expected. At the inception of training, factors including dynamic allocations of caches, and warming up of third-party operators contribute to larger overheads in earlier iterations. Effective communication bandwidth that varies over time can also result in different readings. A common practice to obtain consistent readings is gathering data after 5-10 iterations, as computing and communication time become stable.
Thank you for your detailed explanation and insights into veScale! You're right that it's typical to see variations in the overheads during the initial iterations of training. Factors such as cache warm-up, initialization of third-party operators, and other system-level activities can indeed contribute to these discrepancies.
My concern, however, is that it seems to take a significant number of iterations before the forward and backward computations stabilize. This is different from what I've experienced in other scenarios where the overheads typically stabilize after just a few iterations (usually around 5-10). In my case, the stabilization point appears to be much later, which raises some questions about the efficiency of the setup.
Moreover, I've observed that the issues vary significantly across different runs of the same distributed training code. When analyzing the trace files generated after each run, the problematic areas seem to shift or change, making it challenging to identify a consistent pattern or source of inefficiency. Some runs stabilize after only 5-10 iterations, while others require more than 50 iterations to stabilize. Even more puzzling, some runs appear stable initially but then become unstable later on. This variability makes it difficult to optimize the training process.
Like this:
Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct,but step=20 and step30 seems incorrect.
Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup.
Step= 2;
Step = 20;
Step = 30;
Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct,but step=20 and step30 seems incorrect.
Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup.
Step= 2;
Step = 20;
Step = 30;
Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon.
A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after flush and maybe slowing down cuda kernels launching due to GIL
Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct,but step=20 and step30 seems incorrect. Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup. Step= 2;
Step = 20;
Step = 30;
Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after
flushand maybe slowing down cuda kernels launching due to GIL
ok,thanks for your answer!
this is my code :
@with_comms
def test_runtime_engine_with_profiling(self):
print("test_runtime_engine_with_profiling")
"""
Tests runtime engine with distributed nD timeline profiling.
"""
# initialize global device mesh-3d
VESCALE_DEVICE_MESH.init_device_mesh(
device_type="cuda",
mesh_shape=(4, 1, 1),
mesh_dim_names=("PP", "DP", "TP"),
)
global local_rank
local_rank = self.rank
device = f"cuda:{local_rank}"
# must do this: https://pytorch.org/docs/stable/distributed.html
torch.cuda.set_device(device)
os.environ["LOCAL_RANK"] = str(local_rank)
from vescale.ndtimeline import init_ndtimers, flush, wait
init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
#FourMLP-Model
n_hidden = 3
batches = 128
model = FourMLP(n_hidden)
all_batches_out = []
fwd_plan = {
".input": [[Replicate()]],#
".output": [[Replicate()]], #
}
model_list = []
tp_mesh = VESCALE_DEVICE_MESH.get_tensor_parallel_mesh()
if local_rank == 0:
model.mlp1 = parallelize_module(model.mlp1, tp_mesh, {"parameter": None, "forward": fwd_plan}) # parameter为并行优化参数,
model_list = [model.mlp1]
elif self.rank == 1:
model.mlp2 = parallelize_module(model.mlp2, tp_mesh, {"parameter": None, "forward": fwd_plan})
model_list = [model.mlp2]
elif self.rank == 2:
model.mlp3 = parallelize_module(model.mlp3, tp_mesh, {"parameter": None, "forward": fwd_plan})
model_list = [model.mlp3]
elif self.rank == 3:
model.mlp4 = parallelize_module(model.mlp4, tp_mesh, {"parameter": None, "forward": fwd_plan})
model_list = [model.mlp4]
deps = get_linear_pp_module_dep2(model_list, VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes())
data_iterator = []
for i in range(batches):
data = torch.zeros(1, 1, n_hidden) + i
data_iterator.append(distribute_tensor(data.float(), tp_mesh, placements=[Replicate()]))
pipe_engine = ScheduleEngine(
deps=deps,
meshes=VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes(),
schedule=PipelineScheduleType.SIMPLE_1F1B,
batches=batches,
data_iterator=data_iterator,
stage_id=local_rank,
shape=(1, 1, 3),
dtype=torch.float32,
)
n_steps = 30
for step in range(n_steps):
print(f"Step {step + 1}:")
if local_rank == 3:
model.zero_grad(set_to_none=True)
_, all_forward = ScheduleEngine.execute(pipe_engine)
if local_rank == 3:
loss_per_microbatch = [item[1] for item in all_forward]
for t1, t2 in zip(loss_per_microbatch, all_batches_out):
pass
all_batches_out = loss_per_microbatch
flush()
wait()
if local_rank == 3:
print("nD timeline analysis for step 10 completed.")
Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct,but step=20 and step30 seems incorrect. Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup. Step= 2;
Step = 20;
Step = 30;
Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after
flushand maybe slowing down cuda kernels launching due to GILok,thanks for your answer! this is my code : @with_comms def test_runtime_engine_with_profiling(self): print("test_runtime_engine_with_profiling") """ Tests runtime engine with distributed nD timeline profiling. """ # initialize global device mesh-3d VESCALE_DEVICE_MESH.init_device_mesh( device_type="cuda", mesh_shape=(4, 1, 1), mesh_dim_names=("PP", "DP", "TP"), ) global local_rank local_rank = self.rank device = f"cuda:{local_rank}" # must do this: https://pytorch.org/docs/stable/distributed.html torch.cuda.set_device(device) os.environ["LOCAL_RANK"] = str(local_rank)
from vescale.ndtimeline import init_ndtimers, flush, wait init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True) #FourMLP-Model n_hidden = 3 batches = 128 model = FourMLP(n_hidden) all_batches_out = [] fwd_plan = { ".input": [[Replicate()]],# ".output": [[Replicate()]], # } model_list = [] tp_mesh = VESCALE_DEVICE_MESH.get_tensor_parallel_mesh() if local_rank == 0: model.mlp1 = parallelize_module(model.mlp1, tp_mesh, {"parameter": None, "forward": fwd_plan}) # parameter为并行优化参数, model_list = [model.mlp1] elif self.rank == 1: model.mlp2 = parallelize_module(model.mlp2, tp_mesh, {"parameter": None, "forward": fwd_plan}) model_list = [model.mlp2] elif self.rank == 2: model.mlp3 = parallelize_module(model.mlp3, tp_mesh, {"parameter": None, "forward": fwd_plan}) model_list = [model.mlp3] elif self.rank == 3: model.mlp4 = parallelize_module(model.mlp4, tp_mesh, {"parameter": None, "forward": fwd_plan}) model_list = [model.mlp4] deps = get_linear_pp_module_dep2(model_list, VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes()) data_iterator = [] for i in range(batches): data = torch.zeros(1, 1, n_hidden) + i data_iterator.append(distribute_tensor(data.float(), tp_mesh, placements=[Replicate()])) pipe_engine = ScheduleEngine( deps=deps, meshes=VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes(), schedule=PipelineScheduleType.SIMPLE_1F1B, batches=batches, data_iterator=data_iterator, stage_id=local_rank, shape=(1, 1, 3), dtype=torch.float32, ) n_steps = 30 for step in range(n_steps): print(f"Step {step + 1}:") if local_rank == 3: model.zero_grad(set_to_none=True) _, all_forward = ScheduleEngine.execute(pipe_engine) if local_rank == 3: loss_per_microbatch = [item[1] for item in all_forward] for t1, t2 in zip(loss_per_microbatch, all_batches_out): pass all_batches_out = loss_per_microbatch flush() wait() if local_rank == 3: print("nD timeline analysis for step 10 completed.")
There are two misuse of ndtimeline.
- wrong arguments
local
init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
- unnecessary call of
wait
for step in range(n_steps):
print(f"Step {step + 1}:")
if local_rank == 3:
model.zero_grad(set_to_none=True)
_, all_forward = ScheduleEngine.execute(pipe_engine)
if local_rank == 3:
loss_per_microbatch = [item[1] for item in all_forward]
for t1, t2 in zip(loss_per_microbatch, all_batches_out):
pass
all_batches_out = loss_per_microbatch
flush()
wait()
I think misuse of wait() is the root of this phenomenon
You should move wait() to the end of this file, which should be called only once.
Hi! (1)I've set up my experiment again with step=30 and batch=128. I visualized the JSON files generated at steps 2, 20, and 30. (2)I noticed that as the step increases, the 1f1b (one forward pass followed by one backward pass) scheduling pattern is generally satisfied. However, during the backward and forward computations, there is still some instability. step2 seems correct,but step=20 and step30 seems incorrect. Why might this be happening? I wondered if this is an issue with the timeline capturing or a problem with my own code - setup. Step= 2;
Step = 20;
Step = 30;
Do you mind providing the source code and data to help us figure out what happened? We haven't been able to reproduce this phenomenon. A possible reason is that timer metrics are flush asynchronously, leading to some overhead at one or more steps after
flushand maybe slowing down cuda kernels launching due to GILok,thanks for your answer! this is my code : @with_comms def test_runtime_engine_with_profiling(self): print("test_runtime_engine_with_profiling") """ Tests runtime engine with distributed nD timeline profiling. """ # initialize global device mesh-3d VESCALE_DEVICE_MESH.init_device_mesh( device_type="cuda", mesh_shape=(4, 1, 1), mesh_dim_names=("PP", "DP", "TP"), ) global local_rank local_rank = self.rank device = f"cuda:{local_rank}" # must do this: https://pytorch.org/docs/stable/distributed.html torch.cuda.set_device(device) os.environ["LOCAL_RANK"] = str(local_rank)
from vescale.ndtimeline import init_ndtimers, flush, wait init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True) #FourMLP-Model n_hidden = 3 batches = 128 model = FourMLP(n_hidden) all_batches_out = [] fwd_plan = { ".input": [[Replicate()]],# ".output": [[Replicate()]], # } model_list = [] tp_mesh = VESCALE_DEVICE_MESH.get_tensor_parallel_mesh() if local_rank == 0: model.mlp1 = parallelize_module(model.mlp1, tp_mesh, {"parameter": None, "forward": fwd_plan}) # parameter为并行优化参数, model_list = [model.mlp1] elif self.rank == 1: model.mlp2 = parallelize_module(model.mlp2, tp_mesh, {"parameter": None, "forward": fwd_plan}) model_list = [model.mlp2] elif self.rank == 2: model.mlp3 = parallelize_module(model.mlp3, tp_mesh, {"parameter": None, "forward": fwd_plan}) model_list = [model.mlp3] elif self.rank == 3: model.mlp4 = parallelize_module(model.mlp4, tp_mesh, {"parameter": None, "forward": fwd_plan}) model_list = [model.mlp4] deps = get_linear_pp_module_dep2(model_list, VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes()) data_iterator = [] for i in range(batches): data = torch.zeros(1, 1, n_hidden) + i data_iterator.append(distribute_tensor(data.float(), tp_mesh, placements=[Replicate()])) pipe_engine = ScheduleEngine( deps=deps, meshes=VESCALE_DEVICE_MESH.get_global_tensor_parallel_meshes(), schedule=PipelineScheduleType.SIMPLE_1F1B, batches=batches, data_iterator=data_iterator, stage_id=local_rank, shape=(1, 1, 3), dtype=torch.float32, ) n_steps = 30 for step in range(n_steps): print(f"Step {step + 1}:") if local_rank == 3: model.zero_grad(set_to_none=True) _, all_forward = ScheduleEngine.execute(pipe_engine) if local_rank == 3: loss_per_microbatch = [item[1] for item in all_forward] for t1, t2 in zip(loss_per_microbatch, all_batches_out): pass all_batches_out = loss_per_microbatch flush() wait() if local_rank == 3: print("nD timeline analysis for step 10 completed.")There are two misuse of ndtimeline.
- wrong arguments
localinit_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True)
- unnecessary call of
waitfor step in range(n_steps): print(f"Step {step + 1}:") if local_rank == 3: model.zero_grad(set_to_none=True) _, all_forward = ScheduleEngine.execute(pipe_engine) if local_rank == 3: loss_per_microbatch = [item[1] for item in all_forward] for t1, t2 in zip(loss_per_microbatch, all_batches_out): pass all_batches_out = loss_per_microbatch flush() wait()I think misuse of
wait()is the root of this phenomenon You should movewait()to the end of this file, which should be called only once.
thanks! but init_ndtimers(rank=int(local_rank), local=int(local_rank), enable_streamer=True) is original code...
I will fix this typo in test file later
I will fix this typo in test file later
OK, thanks!
Hello, @zmtttt
I'm new to vescale, and I see you have experience running training codes with 2D or higher dimensions.
I'm facing a lot of difficulties with PP+DP training—would you be willing to share any of your experiences?
Any comments or insights would be really helpful.
Best regards, Taekyoung
Step = 20;
Step = 30; 