ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[FEATURE]: New way to tensotboard

Open jiezhuzzz opened this issue 3 years ago • 4 comments

Describe the feature

Currently, we use torch.tensorboard.SummaryWriter to export profiling data. Since the very limited APIs provided by tensorboard, we cannot draw interactive pie charts and iteration-level memory usage charts. Therefore, we should develop our own tensorboard plugin manager. And the first thing we shall do is export our profiling data in a JSON file.

jiezhuzzz avatar Apr 11 '22 08:04 jiezhuzzz

The JSON file can be split into three parts: memory usage, communication overhead, and PCIe bandwidth, according to our current implementation of colossalai profiler.

memory usage JSON:

{
"schedule": 
"step": 
"information": {
    "worker": 
    "cuda_usage": []
}

communication overhead JSON:

{
"total_cuda_time": 
"total_comm_vol": 
"total_count": 
"events": [{
    "location": 
    "self_cuda_time": 
    "self_comm_vol": 
    "self_count": 
    }]
}

PCIe bandwidth JSON:

"h2d_time": 
"h2d_count": 
"d2h_time": 
"d2h_count": 
"events": [{
    "location": 
    "cuda_time": 
    "pcie_vol": 
    "count": 
    }]

jiezhuzzz avatar Apr 11 '22 08:04 jiezhuzzz

⚠️ Keep in mind: multiple processes may access the same file, you should make sure the JSON file is consistent

ver217 avatar Apr 11 '22 08:04 ver217

⚠️ Keep in mind: multiple processes may access the same file, you should make sure the JSON file is consistent

Each process should keep its own json file.

FrankLeeeee avatar Apr 11 '22 08:04 FrankLeeeee

OK, please check my latest commit #717. Different filenames refer to different processes.

jiezhuzzz avatar Apr 11 '22 09:04 jiezhuzzz

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 03:04 binmakeswell