tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Tensorboard not displaying all the HParams events

Open yashmanuda opened this issue 4 years ago • 19 comments

This is my logs directory structure : image

There are two Jupyter notebooks running in parallel which have the exact same code except for the prefix of the run-* directory. Both of them dump hparam_tuning metrics in the same directory. Both of them train the same model with same hyper-parameters and metrics, but on different data. My requirement is to view all these runs in same table of tensorboard.

EDIT NOTE : Training data in the code gets generated & processed ONLY once in one notebook for all the runs. I cannot read & process the data multiple times for different runs. Also, both these notebooks run on different GPUs, I want to run them in parallel which is why I cannot run them in one notebook since the runs are sequential.

Tensorboard reads only 9 of the 18 hparam runs that I have in my logs directory : image

Although I am able to see the scalars which I am using to monitor the loss, which are in the same log directory. Moreover, the metrics for hyper-parameter tuning are also visible for all the runs under the "Scalar" tab, but not under HPARAMS tab.

image

EDIT NOTE : I have kept the identifier as another hyper-parameter to view hparam logs generated by both the Jupyter notebooks. It's just for filtering purposes, since there is no feature to filter them via trialId.

Is there any way I can merge the results of different session runs in one table?

yashmanuda avatar Nov 17 '19 07:11 yashmanuda

Can you try passing the new --reload_multifile=true flag to TensorBoard and see if that addreses the issue? You'll need version 2.0.0 or greater.

Without this flag, it's usually a bad idea to write data concurrently to multiple event files (which it looks like you're doing since both jupyter notebooks are writing to the top-level log directory, and there are two events.out.tfevents.* files shown there). TensorBoard historically only reads new data from the last event file in each directory, which means it won't see new data added to the earlier file. Passing --reload_multifile=true should eliminate that possible pitfall.

nfelt avatar Nov 19 '19 06:11 nfelt

--reload_multifile=true is not working. It's picking only one event file. Is there anything else that needs to be done?

yashmanuda avatar Nov 20 '19 08:11 yashmanuda

Can you provide a bit more detail on what you mean by "picking only one event file"? Also, can you provide the exact command you're using to launch TensorBoard and if possible the rest of the information in our diagnosis script, as requested in our issue template? https://github.com/tensorflow/tensorboard/blob/master/.github/ISSUE_TEMPLATE/bug_report.md

If you're able to provide a copy of the actual event files as well (e.g. as a .zip), that would also be helpful.

nfelt avatar Nov 21 '19 22:11 nfelt

wasted 4 hours on this trying to get it to work to no avail. using tensorboard 2.3.0 and the SummaryWriter that is included in pytorch 1.3.1

this is roughly how I do the logging (in my case all the logging happens not during but after training, which should not make a difference):

rm -rf /tmp/tb_logs/ # lets remove the old logs first

python code:

from torch.utils.tensorboard import SummaryWriter
for identifier, result in trial_results.items(): 
    writer = SummaryWriter(log_dir=f"/tmp/tb_logs/{identifier}")
    for metric_name, val in result.items():  # not sure if this is necessary
        writer.add_scalar(metric_name, val)
    writer.add_hparams({k: v for k, v in trial_hparams[identifier].items()},
                       {f"hparams/{metric_name}" : val for metric_name, val in result.items()})
    writer.close()

I can see that the results are stored:

$ find . -iname "events.out*" | wc -l
216

then I'm visualizing with: tensorboard --logdir /tmp/tb_logs --reload_multifile=true

but I'm only seeing 5 results in the hparams overview. the scalar overview shows all results, not just the 5 in hparams

here's a small extract of the data I'm using:

trial_results = {
  '0001_0008': {'AverageRecall_vehicle_Valid': 0.540647},
  '0001_0026': {'AverageRecall_vehicle_Valid': 0.535381}
}
trial_hparams = {
  '0001_0008': {'confLossFactr': 0.5, 'confIncByWeakDet': 0, 'confIncByStrongDet': 0, 'confPropIncDetThresh': 0.65}, 
  '0001_0026': {'confLossFactr': 0.7, 'confIncByWeakDet': 0, 'confIncByStrongDet': 0, 'confPropIncDetThresh': 0.325}
}

dominikdienlin avatar Oct 15 '20 13:10 dominikdienlin

Ok so this took me a while. I have a mix of old and new logs with different number of hparams. The additional hparams add later did not show up, even after restart tensorboard multiple times. I have to delete all the logs and restart tensorboard to see all the hparams.

lkhphuc avatar Oct 16 '20 15:10 lkhphuc

As you can see in my post above, I've deleted the log folder and still only seeing a fraction of the hyperparameters

dominikandreas avatar Oct 16 '20 15:10 dominikandreas

Hey, I have the same issue. In my case, I’m writing sequential into the same folder. This means in a separate run method after the training and retraining of my model.

def run1(run_dir, hparams, …): with tf.summary.create_file_writer(run_dir).as_default() as writer: hp.hparams(hparams) loss_fine, accuracy_fine = train_test_model (hparams, …) tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1) tf.summary.scalar(METRIC_LOSS, loss, step=1)
writer.close() … def run2(run_dir, hparams, …): with tf.summary.create_file_writer(run_dir).as_default() as writer: hp.hparams(hparams) loss_fine_tuning, accuracy_fine_tuning = train_test_model_fune_tuning(hparams, …) tf.summary.scalar(METRIC_ACCURACY_FINE_TUNING, accuracy_fine_tuning, step=1) tf.summary.scalar(METRIC_LOSS_FINE_TUNING, loss_fine_tuning, step=1)
writer.close()

The result of my folder structure (Screenshot jupyter notebook) image

DanielDevito82 avatar Jan 25 '21 22:01 DanielDevito82

same issue, any suggestions?

HoltSpalding avatar Apr 02 '21 15:04 HoltSpalding

After recording experiments throught the past few weeks I realized that I am having the same issue ... is there any chance that I get the desired information without redoing the experiments? (e.g. by looping over the summaries manually?)

FranzKnut avatar May 04 '21 16:05 FranzKnut

After recording experiments throught the past few weeks I realized that I am having the same issue ... is there any chance that I get the desired information without redoing the experiments? (e.g. by looping over the summaries manually?)

That's definitely possible. You find your solution? I've found a few great logging solutions that don't have these issues also if you're interested, some I think can even read the serialized tensorboard logs.

HoltSpalding avatar May 06 '21 13:05 HoltSpalding

My solution is to store one trial into one folder. So, if I had 16 trials, there would be 16 folders.

hp_path_for_this_trial = os.path.join(hp_path, trial_id)

model.fit(
    x_train, y_train, epochs=1, 
    callbacks=[
        tf.keras.callbacks.TensorBoard(tensorboard_path),  # log metrics
        hp.KerasCallback(hp_path_for_this_trial, hparams, trial_id = trial_id),  # log hparams
    ]
)

blacksnail789521 avatar Jul 27 '21 12:07 blacksnail789521

Same problem here. Cant show hparams of generated in different sessions all together in tensorboard.

oconnor127 avatar Jan 25 '22 07:01 oconnor127

Ok so this took me a while. I have a mix of old and new logs with different number of hparams. The additional hparams add later did not show up

This is very similar to what I have observed in https://github.com/tensorflow/tensorboard/issues/3597. In my case, I had a few None values for numeric variables, and these None values where encoded as strings. As a result, TB interpreted this variable as a string variable and did not even show the runs with numeric values.

bersbersbers avatar Feb 04 '22 22:02 bersbersbers

Same issue:

"If I have 2 independent runs with different summary writer instances and each run logs to a different directory (.e.g ~/001 and ~/002, then I can point tensorboard to each of the logdirs and see the full set of hyperparameters, respectively. Now I want to compare both runs in a single view, so I point tensorboard to the parent dir, namely ~/. If I check the hparams view again I am left with only the union of both hyperparameter sets. All hyperparameters that are unique to one of the runs are not shown anymore."

Is there any reliable workaround or plan to fix this?

LarsHill avatar Apr 09 '22 11:04 LarsHill

Any solution or fix so far? If not fixed, then the HParam function is nearly useless.

semaphore-egg avatar Jan 19 '23 14:01 semaphore-egg

After recording experiments throught the past few weeks I realized that I am having the same issue ... is there any chance that I get the desired information without redoing the experiments? (e.g. by looping over the summaries manually?)

That's definitely possible. You find your solution? I've found a few great logging solutions that don't have these issues also if you're interested, some I think can even read the serialized tensorboard logs.

I created the attached script to extract the parameters and metrics for the scalars as csv-file.

TB logs extraction script

# -*- coding: utf-8 -*-
from __future__ import print_function

import argparse
import os
import traceback
from datetime import datetime

import numpy as np
import pandas as pd
import sys
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
from tensorboard.backend.event_processing.event_file_inspector import get_inspection_units
from tensorboard.plugins.hparams import plugin_data_pb2


def parse_arguments():
    parser = argparse.ArgumentParser(description='Extract data from Tensorboard logs to pandas and save')
    parser.add_argument('--force', help='Forces extraction even if up-to-date file exists.', action='store_true')
    parser.add_argument('--dir', help='Log Folder', default='../logs')
    global args
    args = parser.parse_args()


def get_hparams(acc: EventAccumulator):
    """
    Loops over all events in the given directory and looks for a hparams
    :param acc:
    :return: None if None was found or
    """

    row = {}
    # get all hparams contained in the acc session_start_info is there is any
    if '_hparams_/session_start_info' in acc.summary_metadata:
        data = acc.SummaryMetadata('_hparams_/session_start_info').plugin_data.content
        pdata = plugin_data_pb2.HParamsPluginData.FromString(data)
        if pdata.session_start_info.hparams:
            for k in pdata.session_start_info.hparams.keys():
                row[k] = eval(str(pdata.session_start_info.hparams[k]).split(':')[1].strip().capitalize())

    return row


def main():
    root_dir = args.dir
    output_file = 'tb_data.csv'
    if not os.path.isdir(root_dir):
        print(root_dir + " does not exist!")
        return

    tb_data_filename = os.path.join(root_dir, output_file)
    if os.path.isfile(tb_data_filename):
        df_tb_data = pd.read_csv(tb_data_filename, index_col=[0])
    else:
        df_tb_data = pd.DataFrame()

    print(f"Get inspection units from {root_dir}")
    inspect_units = get_inspection_units(logdir=root_dir)

    if not inspect_units:
        print("No inspection units found in " + root_dir)
        return

    _extracted_tags = []
    for run in inspect_units:
        path = os.path.relpath(run.name)

        if path not in df_tb_data.index or args.force:

            output = {path: {}}

            try:
                acc = EventAccumulator(run.name)
                acc.Reload()
                _hparams = get_hparams(acc)

                # Skip if no hparams or no scalars
                if not _hparams or len(acc.Tags()['scalars']) == 0:
                    continue

                output[path] = _hparams
                output[path]['date'] = datetime.fromtimestamp(acc.FirstEventTimestamp())
                # Some tags are logged each training sample
                # Therefore the tags relevant for the total epochs are hardcoded here
                tags_for_epoch_count = ['epoch']
                output[path]['total epochs'] = max(
                    [acc.Scalars(tag)[-1].value for tag in tags_for_epoch_count])

                last_timestamp = acc.FirstEventTimestamp()
                for tag in acc.Tags()['scalars']:
                    scalar = acc.Scalars(tag)
                    last_timestamp = max(last_timestamp, scalar[-1].wall_time)
                    _all_values = np.array([s.value for s in scalar][1:])
                    if len(_all_values):
                        output[path][tag + ' last'] = _all_values[-1]
                        output[path][tag + ' min'] = min(_all_values)
                        output[path][tag + ' min epoch'] = np.where(_all_values == min(_all_values))[0][0]
                        if tag not in _extracted_tags:
                            _extracted_tags.append(tag)

                output[path]['end_date'] = datetime.fromtimestamp(last_timestamp)

                # Add to the df and save it
                df_tb_data = pd.concat([df_tb_data, pd.DataFrame(output).T])
            except Exception:
                print("Error while parsing {}:".format(path))
                print(traceback.format_exc())
                # TODO: write erronous paths to file

    if not df_tb_data.empty:
        df_tb_data.to_csv(tb_data_filename)

    print()
    print("All finished.")


if __name__ == '__main__':
    parse_arguments()
    sys.exit(main())

FranzKnut avatar Mar 24 '23 15:03 FranzKnut

Hello folks, I'm sorry this has been an issue for you. I can't promise that we will be able to prioritize this in the near future, but I stumbled upon this after the recent comment, and just wanted to add a note here:

If anybody running into this issue was able to provide the diagnostics that @nfelt mentioned in a previous comment, and/or a way to reproduce with some source script and data or at least some event files, it could make it easier for us to look at it and find the issue. =]

From what I see, (at least some) users are using the SummaryWriter implementation from pytorch. There's some chance it's an issue with that implementation.

Can you provide a bit more detail on what you mean by "picking only one event file"? Also, can you provide the exact command you're using to launch TensorBoard and if possible the rest of the information in our diagnosis script, as requested in our issue template? https://github.com/tensorflow/tensorboard/blob/master/.github/ISSUE_TEMPLATE/bug_report.md

If you're able to provide a copy of the actual event files as well (e.g. as a .zip), that would also be helpful.

arcra avatar Mar 30 '23 02:03 arcra

a way to reproduce with some source script

From https://github.com/tensorflow/tensorboard/issues/3597#issuecomment-1490793918:

import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

def run(param):
    with tf.summary.create_file_writer("logs/param_" + str(param)).as_default():
        hp.hparams({"param": param})
        tf.summary.scalar("metric", 0, step=0)

run(0)
run(1)
run("str")

bersbersbers avatar Mar 30 '23 19:03 bersbersbers

I would be very interested in this being fixed too! Being able to compare models using slightly different sets of hyperparameters is very important in my experiments.

leleogere avatar Jan 23 '24 10:01 leleogere