[RFC] A tracer to monitor the memory usage during training
Describe the feature
I propose to implement a runtime memory tracer. It can be turned on/off by users. It traces the GPU memory footprint during the training process, or a few iterations and shows the results to the user in an elegant way. The tracer has to be as lightweight as possible and brings less performance loss.
Benefits:
- With tracing, we can have better the offloading strategy.
- Knowing the memory statistics, we can know how to set an appropriate batch size, or even choose the best model parallelism strategy.
Current status: I have implemented a MemTracerOpHook at dir engine/ophooks. You can pass an instance of the MemTracerOpHook to class Engine.
TODO: The user passes an args or adds items in the config file to turn the tracer on. During DNN training, a logging file is appended in real-time. How to present the results to users.
-
online way: A process will read the file and show real-time results to the user.
-
offline way: After training is finished, the user can read the file and show the memory footprint statistics during the whole training stage. As the statistics are similar between iterations, show
Zhu believes we can refer to the following projects to design the UI:
https://github.com/ClementTsang/bottom https://github.com/aksakalli/gtop
Work in progress
We have updated a lot. This issue was closed due to inactivity. Thanks.