dlrover
dlrover copied to clipboard
[Feature]: Summarize the elapsed time of PyTorch ops in a training job.
trafficstars
Users usually need to detect the bottleneck of the training pipeline by viewing the elapsed time of ops. If we can automatically summarize the elapsed time after the training starts, we can automatically detect the bottleneck and make efforts to mitigate the bottleneck or give some suggestions to users.
import time
def train(): for i, epoch in enumerate(range(start_epoch, end_epoch)): for train_sample in train_data_loader: start_time = time.time() doing... print('Time consuming: {}s'.format(time.time() - start_time))
This issue has been automatically marked as stale because it has not had recent activity.