[FEATURE]: New way to tensotboard
Describe the feature
Currently, we use torch.tensorboard.SummaryWriter to export profiling data. Since the very limited APIs provided by tensorboard, we cannot draw interactive pie charts and iteration-level memory usage charts. Therefore, we should develop our own tensorboard plugin manager. And the first thing we shall do is export our profiling data in a JSON file.
The JSON file can be split into three parts: memory usage, communication overhead, and PCIe bandwidth, according to our current implementation of colossalai profiler.
memory usage JSON:
{
"schedule":
"step":
"information": {
"worker":
"cuda_usage": []
}
communication overhead JSON:
{
"total_cuda_time":
"total_comm_vol":
"total_count":
"events": [{
"location":
"self_cuda_time":
"self_comm_vol":
"self_count":
}]
}
PCIe bandwidth JSON:
"h2d_time":
"h2d_count":
"d2h_time":
"d2h_count":
"events": [{
"location":
"cuda_time":
"pcie_vol":
"count":
}]
⚠️ Keep in mind: multiple processes may access the same file, you should make sure the JSON file is consistent
⚠️ Keep in mind: multiple processes may access the same file, you should make sure the JSON file is consistent
Each process should keep its own json file.
OK, please check my latest commit #717. Different filenames refer to different processes.
We have updated a lot. This issue was closed due to inactivity. Thanks.