tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Tensorboard metrics not loaded properly

Open timr1101 opened this issue 1 year ago • 11 comments

I have the problem that in Tensorboard the metrics are not loaded correctly (the column is always empty), although the scalars are saved correctly. I am working with torch.utils.tensorboard.

tensorboard_metrics

Relevant code:

writer = SummaryWriter(log_dir=f'./logs/studies/{study_name}/')

In the training loop: writer.add_scalar(tag='validation/min_loss', scalar_value=min_val_loss, global_step=trial.number)

Add the hyperparameter to the summary writer (args_dict is a dictionary with all hyperparameters) writer.add_hparams(hparam_dict=args_dict, metric_dict={'validation/min_loss': min_val_loss}, run_name=run_name) writer.close()

timr1101 avatar Sep 06 '24 22:09 timr1101

Are the metrics showing up in the Time Series or Scalar tabs? Did you try selecting the "show metrics" check boxes?

JamesHollyer avatar Sep 16 '24 18:09 JamesHollyer

The scalars associated with the metrics are loaded correctly in both the TIME SERIES and SCALARS tabs. The only problem is that no metrics are displayed in the HPARAMS tab. When I select the "show metrics" checkboxes, a completely empty chart pops up.

timr1101 avatar Sep 18 '24 09:09 timr1101

Wow that is strange! I do not see why that would happen and I cannot seem to reproduce it. Is this happening with other logs or just this one?

JamesHollyer avatar Sep 18 '24 16:09 JamesHollyer

Yes, it's weird. It doesn't seem to be a problem only with these specific logs. I've also used other scalars as metrics, but that didn't change the result. It is perhaps also noteworthy that I encountered exactly the same problem with a completely different implementation, namely the code from the Official Guide to Hyperparameter Optimization with tensorboard (this is a tensorflow implementation). The scalars were displayed correctly in the TIME SERIES and SCALARS tab, but the column of the corresponding metric „Accuracy“ in the HPARAMS tab remained empty.

IMG_0258

Related code (from the official guide):


import tensorflow as tf
from tensorboard.plugins.hparams import api as hp


fashion_mnist = tf.keras.datasets.fashion_mnist

(x_train, y_train),(x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))

METRIC_ACCURACY = 'accuracy'

with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
  hp.hparams_config(
    hparams=[HP_NUM_UNITS, HP_DROPOUT, HP_OPTIMIZER],
    metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
  )

def train_test_model(hparams):
  model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=tf.nn.relu),
    tf.keras.layers.Dropout(hparams[HP_DROPOUT]),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax),
  ])
  model.compile(
      optimizer=hparams[HP_OPTIMIZER],
      loss='sparse_categorical_crossentropy',
      metrics=['accuracy'],
  )

  model.fit(x_train, y_train, epochs=1) # Run with 1 epoch to speed things up for demo purposes
  _, accuracy = model.evaluate(x_test, y_test)
  return accuracy

def run(run_dir, hparams):
  with tf.summary.create_file_writer(run_dir).as_default():
    hp.hparams(hparams)  # record the values used in this trial
    accuracy = train_test_model(hparams)
    tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)


session_num = 0

for num_units in HP_NUM_UNITS.domain.values:
  for dropout_rate in (HP_DROPOUT.domain.min_value, HP_DROPOUT.domain.max_value):
    for optimizer in HP_OPTIMIZER.domain.values:
      hparams = {
          HP_NUM_UNITS: num_units,
          HP_DROPOUT: dropout_rate,
          HP_OPTIMIZER: optimizer,
      }
      run_name = "run-%d" % session_num
      print('--- Starting trial: %s' % run_name)
      print({h.name: hparams[h] for h in hparams})
      run('logs/hparam_tuning/' + run_name, hparams)
      session_num += 1

timr1101 avatar Sep 18 '24 21:09 timr1101

Is it possible for you to send me your log files?

JamesHollyer avatar Sep 19 '24 20:09 JamesHollyer

Sure. But since I'm currently on vacation, I can't do this until the beginning of next week.

timr1101 avatar Sep 19 '24 21:09 timr1101

Hey Tim, thanks for sending me your logs. Unfortunately, I still cannot reproduce the issue. I ran these commands:

pip install --upgrade pip pip install tensorboard tensorboard --logdir ./your/log/dir

Screenshot 2024-09-25 at 11 47 30 AM

What version of TensorBoard are you running?

pip freeze | grep tensorboard tensorboard==2.8.0 tensorboard-data-server==0.6.1

JamesHollyer avatar Sep 25 '24 16:09 JamesHollyer

Hey James, the tensorboard versions were indeed the deciding factor. I had the newer versions

tensorboard 2.17.1 tensorboard-data-server 0.7.2

installed. Downgrading to

tensorboard 2.8.0 tensorboard-data-server 0.6.1

solved the problem and all metrics were displayed correctly. Thank you very much for your help! One more note: I also installed the today released version

tensorboard 2.18.0

and

tensorboard-data-server 0.7.2

However, the problem still exists for these.

timr1101 avatar Sep 26 '24 00:09 timr1101

Thanks! Had the same issue here. Works for me with tensorboard==2.16.2 (and tensorboard-data-server==0.7.2).

lebeand avatar Nov 26 '24 23:11 lebeand

Can this issue be re-opened? The issue still persists with the current version 2.18.0 Works for me in 2.16.2, not in 2.17.0

kevinunger avatar Dec 17 '24 23:12 kevinunger

I'm also having the same issue with tensorboard==2.18.0 and tensorboard-data-server==0.7.2.

ArchieLuxton avatar Feb 05 '25 14:02 ArchieLuxton

I have the same issue in tensorboard==2.19.0 Is anyone working on the fix?

itsmohitanand avatar Apr 04 '25 13:04 itsmohitanand

Are people experiencing this, writing their data with a library different from TensorFlow? (e.g. PyTorch?)

I ran the script provided above, and ran it with TB 2.19.0, and I can see the Accuracy metric being displayed as expected.

Since somebody reported this was reproducible starting on 2.17.0, I was suspecting #6822 (from the 2.17 release notes) could have something to do with this, depending on whether the files with the data are directly in the logdir specified in the CLI for the TB command, but I was not able to reproduce.

arcra avatar Apr 16 '25 23:04 arcra

I still experience this on TB 2.19.0 I write my data like this.

from torch.utils.tensorboard import SummaryWriter
w = SummaryWriter("hparams/results")
w.add_hparams(
    {"lr": 1e-3, "batch_size": 4096},
    {"episode_return_mean": 123.4}
)
w.close()

using torch==2.8.0 and TB==2.19.0 i do not get any values in my episode_return_mean column in the hparams tab of tensorboard. Downgrading TB to 2.16.2 fixes it. But then I must downgrade protobuf aswell. Which I cannot do due to other dependencies.

dlindmark avatar May 07 '25 07:05 dlindmark

I copy all codes from hparams demo colab And run my TB 2.19 than encounter this problem But it works fine on colab (TB 2.18)

0523ronli avatar May 16 '25 17:05 0523ronli

I encountered the same issue on TB 2.19.0. I tried several versions: TB 2.19.0 — Display issues TB 2.18.0 — Display issues TB 2.17.1 — Display issues TB 2.17.0 — Display issues TB 2.16.2 — Normal display Downgrading Thunderbird to 2.16.2 resolved the issue. At least TB is functional now. I’m not sure how to fix or further troubleshoot this, but please let me know if you need more information.

XuRainbow avatar Jun 04 '25 07:06 XuRainbow

I am also experiencing this same issue. My environment is python 3.12 through miniconda on Windows 10 Pro 19045, with Tensorboard 2.19.0. Note that I do not have any issue when running Tensorboard from Debian 12, also with Tensorboard with 2.19.0. Even more surprising, mounting my windows directory into WSL Ubuntu 22 and running tensorboard from there also works.

At least in my case, this therefore seems to be a windows specific issue.

This seems to be an issue with purely visualization, not logging, because I can log the data with Windows python and visualize without issue through WSL when mounting the Windows-generated tensorboard logs. This behavior was tested with both tensorflow logging and pytorch torch.utils.tensorboard logging.

@timr1101 @lebeand @kevinunger @dlindmark @0523ronli @XuRainbow Can you confirm if you were using Windows when encountering the issue?

shanzhaii avatar Jul 01 '25 22:07 shanzhaii