OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug] Tensorboard is only available during the training run and not after it has finished

Open arkinson9 opened this issue 2 years ago • 16 comments

Tensorboard, started from the GUI, seems only to work while a training is running:

  1. If I press the "Tensorboard" button before I press the "Start Training" button, the Tensorboard Server seems not to start.
  2. Pressing "Tensorboard" after "Start Training" works, but:
  3. Immediately after a training is finished the connection to the tensorboard get lost. Even if you press the "Tensorboard" button again.

It would be great if you could use the Tensorboard easily via the GUI as long as the GUI window is opened.

arkinson9 avatar Jan 27 '24 12:01 arkinson9

Can confirm I can reproduce this with 100% rate. I thought this was intended behaviour lol

O-J1 avatar Jan 27 '24 13:01 O-J1

Thank you for confirming.

arkinson9 avatar Jan 27 '24 15:01 arkinson9

Can confirm I can reproduce this with 100% rate. I thought this was intended behaviour lol

It is actually completely intended behaviour. I have however created an initial PR to address this, instead tying the server stop to the launch of a new training run. There are some issues with it however, it was tied completely to the UI. I've got to re-work my approach, however at the moment I've got a few projects I'm working on.

SirTrippsalot avatar Jan 27 '24 19:01 SirTrippsalot

Yeah this it is quite annoying. If only we could view these graphs inside the actual program window, instead of having to use a whole web browser just to view some graphs...

Zueuk avatar Mar 20 '24 18:03 Zueuk

@Zueuk thats not what this bug is about at all. What you are referring to is an entirely different thing.

ppbrown avatar May 20 '24 18:05 ppbrown

Hey guys this is a quick hacky temp solution I'm using right now. But you can disable the shutdown of tensor board by commenting out these lines around 778 in the GenericTrainer.py.

    #self.tensorboard.close()
    #if self.config.tensorboard:
        #super()._stop_tensorboard()

Image

This is a very hacky solution though and if you want to run things again you should kill onetrainer and the python process that runs tensorboard on subsequent runs or you'll get a port conflict.

Image

Anyway, this might be helpful at least until SirTrippsalot considers how he wants to approach the solution. Anyway hope it helps and thanks SirTrippsalot and others for all your hard work on this.

seyedaed avatar Feb 14 '25 16:02 seyedaed

IMO simplest clean solution is just to provide a wrapper that uses OneTrainer's venv and path settings, and starts up a standalone tensorboard.

ppbrown avatar Feb 14 '25 17:02 ppbrown

I'm a linux guy, so here's the linux version.

#!/usr/bin/env bash

. venv/bin/activate

tdir=$(python <<EOF
import json
with open("training_presets/#.json") as f:
        jdata = json.load(f)
        print(jdata["workspace_dir"]+"/tensorboard")
EOF
)

# host 0.0.0.0 to allow connection from other machines
tensorboard --logdir $tdir --host 0.0.0.0

ppbrown avatar Feb 14 '25 17:02 ppbrown

@ppbrown thanks man I love it! Only caveat might be if you're really strung out for ram as it looks like its adding another 276MB which might not be a big deal. But it works and without all the hacky stuff. I couldn't get a windows batch version working so I just ran it all in python. This works in Win if you place it in a python file in your OneTrainer root folder and run it. Thanks again man.

import json import subprocess jdata = "" with open("training_presets\#.json") as f: jdata = json.load(f) logfile = jdata['workspace_dir']+'\tensorboard' subprocess.run(f".\venv\Scripts\tensorboard.exe --logdir {logfile} --host 0.0.0.0")

seyedaed avatar Feb 15 '25 06:02 seyedaed

Interesting. That would only work if you have added the "tensorboard" module globally to your python install though.

it is not there by default. For most people you have to activate the venv first.

ppbrown avatar Feb 15 '25 15:02 ppbrown

wait.... it DOES somehow work for me like that. I dont understand why :-/

But I had to use

os.system()

to call ./venv/bin/tensorboard

It wouldnt work for me with subprocess.run()

ppbrown avatar Feb 15 '25 15:02 ppbrown

Yep you can run it from python in venv. I named mine tensorrrun.py put in the OneTrainer root and just run:

.\venv\scripts\python.exe tensorrun.py

seyedaed avatar Feb 15 '25 15:02 seyedaed

well thats no fun. it needs a gui compatible solution. with the linux "#!/bin/env python" magic, it works. Not sure what windows equivalent is. I would think just naming it ".py" should be adequate

ppbrown avatar Feb 15 '25 16:02 ppbrown

Yea a GUI solution would be best. I think the challenge is more of a design issue than anything with the code. SirTrippsalot mentioned it was working that way right now by design. I think your solution is a good temp work around until they get the time to figure how they want to approach it. I think maybe you could make a change to the code to have it run in the tensorboard button but to be honest, I haven't really looked into all the code. I just looked for a quick hack as it was annoying when I'd wake up, the run would be done and I didn't see the performance. Didn't even consider just running tensorboard outside the app for some reason.

For windows you just need to make a python file in the OneTrainer root folder with the following:

import json import subprocess jdata = "" with open("training_presets#.json") as f: jdata = json.load(f) logfile = jdata['workspace_dir']+'\tensorboard' subprocess.run(f".\venv\Scripts\tensorboard.exe --logdir {logfile} --host 0.0.0.0")

Then run:

.\venv\scripts\python.exe tensorrun.py

In the command line in the OneTrainer folder. Thanks for the tip on this one ppbrown - i lurk the discord if you ever want to hmu.

seyedaed avatar Feb 15 '25 16:02 seyedaed

any UI solution would have to reliably stop tensorboard on exit, but also still automatically kill tensorboard directly after a CLI or cloud training, because otherwise the process never ends. Plus, it has to handle changes of workspace dir or tensorboard port while tensorboard is still running. If you don't, tensorboard keeps reading from the wrong directory when you've started the next training.

dxqb avatar Feb 15 '25 16:02 dxqb

My language was imprecise. When I said "GUI" i meant "not command line". ie: "one click in FileManager"

I didnt mean "integrated into OneTrainer main program."

ppbrown avatar Feb 15 '25 17:02 ppbrown