ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[RFC] A tracer to monitor the memory usage during training

Open feifeibear opened this issue 3 years ago • 1 comments

Describe the feature

I propose to implement a runtime memory tracer. It can be turned on/off by users. It traces the GPU memory footprint during the training process, or a few iterations and shows the results to the user in an elegant way. The tracer has to be as lightweight as possible and brings less performance loss.

Benefits:

  1. With tracing, we can have better the offloading strategy.
  2. Knowing the memory statistics, we can know how to set an appropriate batch size, or even choose the best model parallelism strategy.

Current status: I have implemented a MemTracerOpHook at dir engine/ophooks. You can pass an instance of the MemTracerOpHook to class Engine.

TODO: The user passes an args or adds items in the config file to turn the tracer on. During DNN training, a logging file is appended in real-time. How to present the results to users.

  1. online way: A process will read the file and show real-time results to the user.

  2. offline way: After training is finished, the user can read the file and show the memory footprint statistics during the whole training stage. As the statistics are similar between iterations, show

Zhu believes we can refer to the following projects to design the UI:

https://github.com/ClementTsang/bottom https://github.com/aksakalli/gtop

feifeibear avatar Feb 25 '22 09:02 feifeibear

Work in progress

binmakeswell avatar Apr 13 '22 04:04 binmakeswell

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 03:04 binmakeswell