aim icon indicating copy to clipboard operation
aim copied to clipboard

[feat] Tensorboard real-time sync

Open Sharathmk99 opened this issue 2 years ago • 6 comments

Fixes #2015

Sharathmk99 avatar Aug 14 '22 12:08 Sharathmk99

@mihran113 @alberttorosyan @SGevorg Review please? Is this feature makes sense? Thanks

Sharathmk99 avatar Aug 15 '22 21:08 Sharathmk99

@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor run.py file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can have TBRun plugin derived from Run and extending the functionality to keep track of the tb event files. SystemResourcesRun plugin derived from Run and tracking CPU/GPU/memory usage in a background.

Do you think it would make sense to adjust implementation accordingly?

alberttorosyan avatar Aug 16 '22 08:08 alberttorosyan

@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor run.py file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can have TBRun plugin derived from Run and extending the functionality to keep track of the tb event files. SystemResourcesRun plugin derived from Run and tracking CPU/GPU/memory usage in a background.

Do you think it would make sense to adjust implementation accordingly?

Yes of-course it makes more sense. I understand how much code present in run.py. I can refractor the code to align with the new changes. Do you already have draft implementation in some branch? Many thanks!

Sharathmk99 avatar Aug 16 '22 08:08 Sharathmk99

@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor run.py file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can have TBRun plugin derived from Run and extending the functionality to keep track of the tb event files. SystemResourcesRun plugin derived from Run and tracking CPU/GPU/memory usage in a background. Do you think it would make sense to adjust implementation accordingly?

Yes of-course it makes more sense. I understand how much code present in run.py. I can refractor the code to align with the new changes. Do you already have draft implementation in some branch? Many thanks!

@Sharathmk99 unfortunately we don't have any branch with reference implementation yet. We can discuss the design principles here and implement the new functionality as a plugin. Afterwards, the rest of the code can be easily refactored.

alberttorosyan avatar Aug 16 '22 09:08 alberttorosyan

@Sharathmk99 thanks for your contribution 🙌 The idea we've been exploring for a while is to refactor run.py file and keep the Run class as a basic interface for tracking/logging. In this context everything else currently included in Run becomes plugin. For example, we can have TBRun plugin derived from Run and extending the functionality to keep track of the tb event files. SystemResourcesRun plugin derived from Run and tracking CPU/GPU/memory usage in a background. Do you think it would make sense to adjust implementation accordingly?

Yes of-course it makes more sense. I understand how much code present in run.py. I can refractor the code to align with the new changes. Do you already have draft implementation in some branch? Many thanks!

@Sharathmk99 unfortunately we don't have any branch with reference implementation yet. We can discuss the design principles here and implement the new functionality as a plugin. Afterwards, the rest of the code can be easily refactored.

@alberttorosyan Yes sure. Can you share the ideal bit more, so I can start implementing? Thanks

Sharathmk99 avatar Aug 16 '22 12:08 Sharathmk99

@Sharathmk99 sorry for late response. I think we can select the following approach: 1st of all, separate the new functionality for keeping track of the tensorboard logs into a separate mixin class. This can be implemented as a separate module inside ext/. Next step would be, implementing a new: TBConverterRun class, which will derive from both base Run and the newly implemented mixin, thus enabling the track() method for the mixin itself, and getting the behavior of a base Run. As you may probably noticed, there's a RunAutoClean helper class, responsible for the Run's resources cleanup. The implementation might require some tweaks to make sure the clean-up is done properly and in correct order, regardless the number of mixin classes in the custom *Run classes.

alberttorosyan avatar Aug 18 '22 07:08 alberttorosyan

Hey @Sharathmk99! I just pushed some changes according the discussion we had above. Now there's a new Run class in aim.ext.tensorboard_tracker with cleaner interface (log dir is mandatory, no additional arguments). Please take a look.

alberttorosyan avatar Sep 23 '22 13:09 alberttorosyan

@mihran113 @gorarakelyan since I've also contributed to the PR, could you please review the changes?

alberttorosyan avatar Sep 27 '22 06:09 alberttorosyan

@gorarakelyan @mihran113, please take a look

alberttorosyan avatar Nov 07 '22 08:11 alberttorosyan

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Dec 12 '22 07:12 CLAassistant

@Sharathmk99 I've pushed some changes to resolve conflicts with main branch. There were some big changes related to run resources management. The only remaining step is to complete CLA signing step.

alberttorosyan avatar Dec 12 '22 09:12 alberttorosyan

@Sharathmk99 I've pushed some changes to resolve conflicts with main branch. There were some big changes related to run resources management. The only remaining step is to complete CLA signing step.

Done. Thank you so much

Sharathmk99 avatar Dec 12 '22 13:12 Sharathmk99