veScale icon indicating copy to clipboard operation
veScale copied to clipboard

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment

Open zmtttt opened this issue 1 year ago • 4 comments

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron

@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

zmtttt avatar Sep 18 '24 09:09 zmtttt

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron

@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello!!! I wandered how to use muti machines? “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption?

zmtttt avatar Sep 23 '24 03:09 zmtttt

Thank you for your interest in veScale! For this question, I would like to refer you to talk to @vocaltract who is an expert in MQ handler.

MackZackA avatar Oct 09 '24 14:10 MackZackA

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron @MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello!!! I wandered how to use muti machines? “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption?

“MQHandler” stands for “message queue handler”. We tend to use message queue (MQ) to send metric data. The overhead is quite low because the message queue producer has its own local buffer in memory and will send data to the broker asynchronously. “Central Storage” refers to the infrastructure that consumes messages and persists them in a data warehouse such as Hive, ClickHouse, InfluxDB, and so on.

vocaltract avatar Oct 09 '24 14:10 vocaltract

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron @MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello!!! I wandered how to use muti machines? “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption?

“MQHandler” stands for “message queue handler”. We tend to use message queue (MQ) to send metric data. The overhead is quite low because the message queue producer has its own local buffer in memory and will send data to the broker asynchronously. “Central Storage” refers to the infrastructure that consumes messages and persists them in a data warehouse such as Hive, ClickHouse, InfluxDB, and so on.

thanks! "But I still don't know how to write the MQHandler code. Do I need to create a separate script as a producer to receive messages from consumers? That is, each rank sends its own record information to the consumer, and the corresponding producer receives the rank-record information from different nodes."

zmtttt avatar Oct 10 '24 03:10 zmtttt