tensorboard
tensorboard copied to clipboard
Tensorboard not displaying all the HParams events
This is my logs directory structure :
There are two Jupyter notebooks running in parallel which have the exact same code except for the prefix of the run-* directory. Both of them dump hparam_tuning metrics in the same directory. Both of them train the same model with same hyper-parameters and metrics, but on different data. My requirement is to view all these runs in same table of tensorboard.
EDIT NOTE : Training data in the code gets generated & processed ONLY once in one notebook for all the runs. I cannot read & process the data multiple times for different runs. Also, both these notebooks run on different GPUs, I want to run them in parallel which is why I cannot run them in one notebook since the runs are sequential.
Tensorboard reads only 9 of the 18 hparam runs that I have in my logs directory :
Although I am able to see the scalars which I am using to monitor the loss, which are in the same log directory. Moreover, the metrics for hyper-parameter tuning are also visible for all the runs under the "Scalar" tab, but not under HPARAMS tab.
EDIT NOTE : I have kept the identifier as another hyper-parameter to view hparam logs generated by both the Jupyter notebooks. It's just for filtering purposes, since there is no feature to filter them via trialId.
Is there any way I can merge the results of different session runs in one table?
Can you try passing the new --reload_multifile=true
flag to TensorBoard and see if that addreses the issue? You'll need version 2.0.0 or greater.
Without this flag, it's usually a bad idea to write data concurrently to multiple event files (which it looks like you're doing since both jupyter notebooks are writing to the top-level log directory, and there are two events.out.tfevents.*
files shown there). TensorBoard historically only reads new data from the last event file in each directory, which means it won't see new data added to the earlier file. Passing --reload_multifile=true
should eliminate that possible pitfall.
--reload_multifile=true
is not working. It's picking only one event file. Is there anything else that needs to be done?
Can you provide a bit more detail on what you mean by "picking only one event file"? Also, can you provide the exact command you're using to launch TensorBoard and if possible the rest of the information in our diagnosis script, as requested in our issue template? https://github.com/tensorflow/tensorboard/blob/master/.github/ISSUE_TEMPLATE/bug_report.md
If you're able to provide a copy of the actual event files as well (e.g. as a .zip), that would also be helpful.
wasted 4 hours on this trying to get it to work to no avail. using tensorboard 2.3.0 and the SummaryWriter that is included in pytorch 1.3.1
this is roughly how I do the logging (in my case all the logging happens not during but after training, which should not make a difference):
rm -rf /tmp/tb_logs/ # lets remove the old logs first
python code:
from torch.utils.tensorboard import SummaryWriter
for identifier, result in trial_results.items():
writer = SummaryWriter(log_dir=f"/tmp/tb_logs/{identifier}")
for metric_name, val in result.items(): # not sure if this is necessary
writer.add_scalar(metric_name, val)
writer.add_hparams({k: v for k, v in trial_hparams[identifier].items()},
{f"hparams/{metric_name}" : val for metric_name, val in result.items()})
writer.close()
I can see that the results are stored:
$ find . -iname "events.out*" | wc -l
216
then I'm visualizing with:
tensorboard --logdir /tmp/tb_logs --reload_multifile=true
but I'm only seeing 5 results in the hparams overview. the scalar overview shows all results, not just the 5 in hparams
here's a small extract of the data I'm using:
trial_results = {
'0001_0008': {'AverageRecall_vehicle_Valid': 0.540647},
'0001_0026': {'AverageRecall_vehicle_Valid': 0.535381}
}
trial_hparams = {
'0001_0008': {'confLossFactr': 0.5, 'confIncByWeakDet': 0, 'confIncByStrongDet': 0, 'confPropIncDetThresh': 0.65},
'0001_0026': {'confLossFactr': 0.7, 'confIncByWeakDet': 0, 'confIncByStrongDet': 0, 'confPropIncDetThresh': 0.325}
}
Ok so this took me a while.
I have a mix of old and new logs with different number of hparams. The additional hparams add later did not show up, even after restart tensorboard
multiple times.
I have to delete all the logs and restart tensorboard
to see all the hparams.
As you can see in my post above, I've deleted the log folder and still only seeing a fraction of the hyperparameters
Hey, I have the same issue. In my case, I’m writing sequential into the same folder. This means in a separate run method after the training and retraining of my model.
def run1(run_dir, hparams, …):
with tf.summary.create_file_writer(run_dir).as_default() as writer:
hp.hparams(hparams)
loss_fine, accuracy_fine = train_test_model (hparams, …)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
tf.summary.scalar(METRIC_LOSS, loss, step=1)
writer.close()
…
def run2(run_dir, hparams, …):
with tf.summary.create_file_writer(run_dir).as_default() as writer:
hp.hparams(hparams)
loss_fine_tuning, accuracy_fine_tuning = train_test_model_fune_tuning(hparams, …)
tf.summary.scalar(METRIC_ACCURACY_FINE_TUNING, accuracy_fine_tuning, step=1)
tf.summary.scalar(METRIC_LOSS_FINE_TUNING, loss_fine_tuning, step=1)
writer.close()
The result of my folder structure (Screenshot jupyter notebook)
same issue, any suggestions?
After recording experiments throught the past few weeks I realized that I am having the same issue ... is there any chance that I get the desired information without redoing the experiments? (e.g. by looping over the summaries manually?)
After recording experiments throught the past few weeks I realized that I am having the same issue ... is there any chance that I get the desired information without redoing the experiments? (e.g. by looping over the summaries manually?)
That's definitely possible. You find your solution? I've found a few great logging solutions that don't have these issues also if you're interested, some I think can even read the serialized tensorboard logs.
My solution is to store one trial into one folder. So, if I had 16 trials, there would be 16 folders.
hp_path_for_this_trial = os.path.join(hp_path, trial_id)
model.fit(
x_train, y_train, epochs=1,
callbacks=[
tf.keras.callbacks.TensorBoard(tensorboard_path), # log metrics
hp.KerasCallback(hp_path_for_this_trial, hparams, trial_id = trial_id), # log hparams
]
)
Same problem here. Cant show hparams of generated in different sessions all together in tensorboard.
Ok so this took me a while. I have a mix of old and new logs with different number of hparams. The additional hparams add later did not show up
This is very similar to what I have observed in https://github.com/tensorflow/tensorboard/issues/3597. In my case, I had a few None
values for numeric variables, and these None
values where encoded as strings. As a result, TB interpreted this variable as a string variable and did not even show the runs with numeric values.
Same issue:
"If I have 2 independent runs with different summary writer instances and each run logs to a different directory (.e.g ~/001 and ~/002, then I can point tensorboard to each of the logdirs and see the full set of hyperparameters, respectively. Now I want to compare both runs in a single view, so I point tensorboard to the parent dir, namely ~/. If I check the hparams view again I am left with only the union of both hyperparameter sets. All hyperparameters that are unique to one of the runs are not shown anymore."
Is there any reliable workaround or plan to fix this?
Any solution or fix so far? If not fixed, then the HParam function is nearly useless.
After recording experiments throught the past few weeks I realized that I am having the same issue ... is there any chance that I get the desired information without redoing the experiments? (e.g. by looping over the summaries manually?)
That's definitely possible. You find your solution? I've found a few great logging solutions that don't have these issues also if you're interested, some I think can even read the serialized tensorboard logs.
I created the attached script to extract the parameters and metrics for the scalars as csv-file.
TB logs extraction script
# -*- coding: utf-8 -*-
from __future__ import print_function
import argparse
import os
import traceback
from datetime import datetime
import numpy as np
import pandas as pd
import sys
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
from tensorboard.backend.event_processing.event_file_inspector import get_inspection_units
from tensorboard.plugins.hparams import plugin_data_pb2
def parse_arguments():
parser = argparse.ArgumentParser(description='Extract data from Tensorboard logs to pandas and save')
parser.add_argument('--force', help='Forces extraction even if up-to-date file exists.', action='store_true')
parser.add_argument('--dir', help='Log Folder', default='../logs')
global args
args = parser.parse_args()
def get_hparams(acc: EventAccumulator):
"""
Loops over all events in the given directory and looks for a hparams
:param acc:
:return: None if None was found or
"""
row = {}
# get all hparams contained in the acc session_start_info is there is any
if '_hparams_/session_start_info' in acc.summary_metadata:
data = acc.SummaryMetadata('_hparams_/session_start_info').plugin_data.content
pdata = plugin_data_pb2.HParamsPluginData.FromString(data)
if pdata.session_start_info.hparams:
for k in pdata.session_start_info.hparams.keys():
row[k] = eval(str(pdata.session_start_info.hparams[k]).split(':')[1].strip().capitalize())
return row
def main():
root_dir = args.dir
output_file = 'tb_data.csv'
if not os.path.isdir(root_dir):
print(root_dir + " does not exist!")
return
tb_data_filename = os.path.join(root_dir, output_file)
if os.path.isfile(tb_data_filename):
df_tb_data = pd.read_csv(tb_data_filename, index_col=[0])
else:
df_tb_data = pd.DataFrame()
print(f"Get inspection units from {root_dir}")
inspect_units = get_inspection_units(logdir=root_dir)
if not inspect_units:
print("No inspection units found in " + root_dir)
return
_extracted_tags = []
for run in inspect_units:
path = os.path.relpath(run.name)
if path not in df_tb_data.index or args.force:
output = {path: {}}
try:
acc = EventAccumulator(run.name)
acc.Reload()
_hparams = get_hparams(acc)
# Skip if no hparams or no scalars
if not _hparams or len(acc.Tags()['scalars']) == 0:
continue
output[path] = _hparams
output[path]['date'] = datetime.fromtimestamp(acc.FirstEventTimestamp())
# Some tags are logged each training sample
# Therefore the tags relevant for the total epochs are hardcoded here
tags_for_epoch_count = ['epoch']
output[path]['total epochs'] = max(
[acc.Scalars(tag)[-1].value for tag in tags_for_epoch_count])
last_timestamp = acc.FirstEventTimestamp()
for tag in acc.Tags()['scalars']:
scalar = acc.Scalars(tag)
last_timestamp = max(last_timestamp, scalar[-1].wall_time)
_all_values = np.array([s.value for s in scalar][1:])
if len(_all_values):
output[path][tag + ' last'] = _all_values[-1]
output[path][tag + ' min'] = min(_all_values)
output[path][tag + ' min epoch'] = np.where(_all_values == min(_all_values))[0][0]
if tag not in _extracted_tags:
_extracted_tags.append(tag)
output[path]['end_date'] = datetime.fromtimestamp(last_timestamp)
# Add to the df and save it
df_tb_data = pd.concat([df_tb_data, pd.DataFrame(output).T])
except Exception:
print("Error while parsing {}:".format(path))
print(traceback.format_exc())
# TODO: write erronous paths to file
if not df_tb_data.empty:
df_tb_data.to_csv(tb_data_filename)
print()
print("All finished.")
if __name__ == '__main__':
parse_arguments()
sys.exit(main())
Hello folks, I'm sorry this has been an issue for you. I can't promise that we will be able to prioritize this in the near future, but I stumbled upon this after the recent comment, and just wanted to add a note here:
If anybody running into this issue was able to provide the diagnostics that @nfelt mentioned in a previous comment, and/or a way to reproduce with some source script and data or at least some event files, it could make it easier for us to look at it and find the issue. =]
From what I see, (at least some) users are using the SummaryWriter implementation from pytorch. There's some chance it's an issue with that implementation.
Can you provide a bit more detail on what you mean by "picking only one event file"? Also, can you provide the exact command you're using to launch TensorBoard and if possible the rest of the information in our diagnosis script, as requested in our issue template? https://github.com/tensorflow/tensorboard/blob/master/.github/ISSUE_TEMPLATE/bug_report.md
If you're able to provide a copy of the actual event files as well (e.g. as a .zip), that would also be helpful.
a way to reproduce with some source script
From https://github.com/tensorflow/tensorboard/issues/3597#issuecomment-1490793918:
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp
def run(param):
with tf.summary.create_file_writer("logs/param_" + str(param)).as_default():
hp.hparams({"param": param})
tf.summary.scalar("metric", 0, step=0)
run(0)
run(1)
run("str")
I would be very interested in this being fixed too! Being able to compare models using slightly different sets of hyperparameters is very important in my experiments.