dlrover
dlrover copied to clipboard
Torch Trainer Hook
For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the start of training. This hook should be well-integrated into the training process and should not interfere with the main training tasks.
We need to design this hook in a way that it can trigger our resources reporter
, and potentially, other types of monitors we may add in the future.
We can implement this manually or use callback mechanisms similar to what is available in PyTorch Lightning
.