Colab training: Plotting training process
Hi,
first of all you did a great job!
I'm using Colab to train my custom model. You wrote that it is unfortunately not possible to use Tensorboard to check the training process but you might create your own implementation for graph plotting later. I wanted to ask if there might already be a possibility for plotting?
Thanks in advance and best regards Chris
With Colab, it is absolutely possible to use TensorBoard to check the training process. I don't know how close the original code is to it, because I have changed the original code, but with a little bit of additional code I was able to get Tensorboard tracking to work just fine.
The main thing is that you want to load the Tensorboard cell ( %tensorboard --logdir log cell) before you start the training. Then you can hit the refresh button on the Tensorboard graphic itself so that it pulls the new information from the log file as the training proceeds and more information is being written to the log file.
If you are interested in what I changed, comment back and I can make a better reply explaining exactly what I did.
Hi mayhemsloth, that's great. I would be very happy about a detailed description :)
Thanks a lot in advance and best regards
Here are the changes I made, all to train.py. The summary is that you name a model when you start training, and then use that name to make log subfolders corresponding to that name. (Note I will use validation/test interchangeably). Additionally, a validation log writer, as well as a train log writer, is created so that TensorBoard can overlay an epoch loss for both the training set and validation set on the same graph, which allows you to easily see if you are overfitting. This StackOverflow answer is where the idea came from.
-
Added an argument to
main:custom_model_name = None. Basically this allows you to input a name into the training function so that you can name a model. You can use it for various things later, but in this context it is to name the logs. -
Create subfolders for the different models. This is done simply by making a new filepath by joining your configs TRAIN_LOGDIR path with the custom_model_name that you input. Note that if you don't put a name in, it will be called "unnamed_model".
logdir = os.path.join(TRAIN_LOGDIR, custom_model_name) if custom_model_name is not None else os.path.join(TRAIN_LOGDIR, 'unnamed_model')2a) Change the "clear the past logs" line to the new directory path stored inlogdirinstead of the configs provided global variableif os.path.exists(logdir): shutil.rmtree(logdir) -
Make two different tf.summary writers, a normal writer (training log) and then a validation writer. Please note the use of the os.path.join, which makes a subfolder in the model log subfolder for each writer.
writer = tf.summary.create_file_writer(os.path.join(logdir, 'train'))validate_writer = tf.summary.create_file_writer(os.path.join(logdir,'test')) -
Now the goal is to store the epoch loss in the same named scalar from two different writers. In the code block under the line
# writing validate summary data, change that writer tovalidate_writer. 4a) Add the following inside thewith validate_writer.as_default():blocktf.summary.scalar('epoch_loss/total_loss', total_val/count, step = epoch) -
Right inside the Epoch loop, before the trainset loop, add a variable
total_train_loss = 0.0. This will be used as summation for the total losses in the training steps per epoch. 5a) Inside the trainset loop, addtotal_train_loss += results[5]to sum each total summation. 5b) After the entirety ofwith validate_writer.as_default():block, add the followingwith writer.as_default():tf.summary.scalar('epoch_loss/total_loss', total_train_loss/steps_per_epoch, step = epoch)writer.flush()
Now, when you run TensorBoard before starting training, the main function will update the logs, and you can see the total (average) epoch loss of your training set graphed against the total (average) epoch loss of your validation set. Definitely useful.
Note that with the custom_model_name variable you can also save the models under their own custom names during training, or at the end, but I'll let you work out how to do that. :)
Thank you very much for your very detailed descrition!!! :) I will try to make this changes in my script too :)
@mayhemsloth Thanks again for your great description above. I would also like to plot the accuracy curve with tensorboard besides the loss. Do you have any good advice on how I can do this?
best regards chris
@chrisTopp84
In object detection where you are predicting bounding boxes (like we are here), defining the "accuracy" is a little hard. Just like the total loss is actually the sum of different parts that take into account the different ways that an object detection algorithm can mess up, the "accuracy" for object detection is also more complicated than other algorithms (like simple image classification).
Luckily, in one of the recent updates, the author of this repo so graciously added the standardized object detection way to measure accuracy! mAP, or mean average precision (explained here). This was a fantastic addition because before I was kind of just eye-balling whether my custom trained model was "good enough" for its purpose. Now though, you can break down by class what your model is doing well and not so well.
This functionality is in the get_mAP function in the evaluate_mAP.py file, and, in the original code, is primarily found at the very end of training in the try ... except block at the end of train.py.
To implement what you are asking about, you can simply run the get_mAP function at desired intervals along the training epochs and write the output of the function to a new scalar in the validation log, which can then be read by TensorBoard and shown on a graph during training. Note that the original code for the function get_mAP returns only a single value (the mAP in %), but you can easily modify the function to also return ap_dictionary, an already existing object in get_mAP that is used to calculate the mAP. ap_dictionary is a dictionary whose keys are the class names and whose values are the AP for that class. Once you return this object, you can even write to validation log the breakdown by class, so you can see which class the model is having the most trouble with. This can greatly help in error analysis. Because opening and writing log files (especially to Google Drive) is time-intensive, try to keep the total number of with .... writer.flush() blocks to a minimum. That is, attempt to have all the variables in memory that you want and then write all of them at once to the log.
Depending on a variety of things (like validation set size, GPU speed, number of classes interested in, number of epochs trained etc.) you probably don't want to run this at the end of every epoch. You can set it under an if statement, maybe you run it every 5 epochs, or 10 epochs, or maybe only when there's a new lowest validation loss model.
Please note that the get_mAP function doesn't work well with data augmentations, so I would highly recommend, if using data augmentations, to have a testset.data_aug = False line before the get_mAP line, and then to restore the data_aug parameter back to True after (so that you can continue using data augmentation during training).
Let me know if you have any more questions. I didn't explicitly tell you what to change in the code because 1) I don't have my code set up like this to calculate the mAP during training (only once at the end) and also, primarily 2) it's good to figure stuff out on your own :)