aim
aim copied to clipboard
[feat] Tensorboard real-time sync
Fixes #2015
@mihran113 @alberttorosyan @SGevorg Review please? Is this feature makes sense? Thanks
@Sharathmk99 thanks for your contribution 🙌
The idea we've been exploring for a while is to refactor run.py
file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can have
TBRun
plugin derived from Run
and extending the functionality to keep track of the tb event files.
SystemResourcesRun
plugin derived from Run
and tracking CPU/GPU/memory usage in a background.
Do you think it would make sense to adjust implementation accordingly?
@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor
run.py
file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can haveTBRun
plugin derived fromRun
and extending the functionality to keep track of the tb event files.SystemResourcesRun
plugin derived fromRun
and tracking CPU/GPU/memory usage in a background.Do you think it would make sense to adjust implementation accordingly?
Yes of-course it makes more sense. I understand how much code present in run.py
.
I can refractor the code to align with the new changes. Do you already have draft implementation in some branch?
Many thanks!
@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor
run.py
file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can haveTBRun
plugin derived fromRun
and extending the functionality to keep track of the tb event files.SystemResourcesRun
plugin derived fromRun
and tracking CPU/GPU/memory usage in a background. Do you think it would make sense to adjust implementation accordingly?Yes of-course it makes more sense. I understand how much code present in
run.py
. I can refractor the code to align with the new changes. Do you already have draft implementation in some branch? Many thanks!
@Sharathmk99 unfortunately we don't have any branch with reference implementation yet. We can discuss the design principles here and implement the new functionality as a plugin. Afterwards, the rest of the code can be easily refactored.
@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor
run.py
file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can haveTBRun
plugin derived fromRun
and extending the functionality to keep track of the tb event files.SystemResourcesRun
plugin derived fromRun
and tracking CPU/GPU/memory usage in a background. Do you think it would make sense to adjust implementation accordingly?Yes of-course it makes more sense. I understand how much code present in
run.py
. I can refractor the code to align with the new changes. Do you already have draft implementation in some branch? Many thanks!@Sharathmk99 unfortunately we don't have any branch with reference implementation yet. We can discuss the design principles here and implement the new functionality as a plugin. Afterwards, the rest of the code can be easily refactored.
@alberttorosyan Yes sure. Can you share the ideal bit more, so I can start implementing? Thanks
@Sharathmk99 sorry for late response.
I think we can select the following approach:
1st of all, separate the new functionality for keeping track of the tensorboard logs into a separate mixin class. This can be implemented as a separate module inside ext/
.
Next step would be, implementing a new: TBConverterRun
class, which will derive from both base Run
and the newly implemented mixin, thus enabling the track() method for the mixin itself, and getting the behavior of a base Run
.
As you may probably noticed, there's a RunAutoClean
helper class, responsible for the Run's resources cleanup. The implementation might require some tweaks to make sure the clean-up is done properly and in correct order, regardless the number of mixin classes in the custom *Run
classes.
Hey @Sharathmk99! I just pushed some changes according the discussion we had above. Now there's a new Run
class in aim.ext.tensorboard_tracker
with cleaner interface (log dir is mandatory, no additional arguments). Please take a look.
@mihran113 @gorarakelyan since I've also contributed to the PR, could you please review the changes?
@gorarakelyan @mihran113, please take a look
@Sharathmk99 I've pushed some changes to resolve conflicts with main
branch. There were some big changes related to run resources management.
The only remaining step is to complete CLA signing step.
@Sharathmk99 I've pushed some changes to resolve conflicts with
main
branch. There were some big changes related to run resources management. The only remaining step is to complete CLA signing step.
Done. Thank you so much