huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

Tensorboard Not Displaying

Open lapp0 opened this issue 1 year ago • 11 comments

Describe the bug

Since a few days ago the hub has been unable to render any tensorboards, instead displaying the following message

No dashboards are active for the current data set.

Probable causes:

You haven’t written any data to your event files.
TensorBoard can’t find your event files.

I ran the exact training script but with a new hub_repo_id with report_to="tensorboard". The previous run from a week ago renders a tensorboard page, the new repo doesn't.

Reproduction

https://huggingface.co/models?library=tensorboard&sort=created

I see tensorboards working from ~3 days ago, but none render in the last 2 days.

Sorry if this is the wrong repo for a huggingface.co issue, please let me know where else I should submit it if it's the wrong place.

lapp0 avatar Mar 31 '24 19:03 lapp0

cc: @severo

julien-c avatar Apr 02 '24 07:04 julien-c

Thanks for reporting @lapp0! This should be fixed now. Could you retry on your model and close this issue if appropriate? Thanks!

Wauplin avatar Apr 02 '24 15:04 Wauplin

It's working now! Thanks for fixing quickly @Wauplin, great job!

lapp0 avatar Apr 02 '24 16:04 lapp0

Glad your problem's solved :) Kudos goes to @XciD @severo!

Wauplin avatar Apr 02 '24 16:04 Wauplin

I'm still seeing this issue transiently, is it possible it was reintroduced?

nickcharles avatar Jun 20 '24 04:06 nickcharles

Hi, I'm seeing this issue again @XciD @severo

It seems that

  • Tensorboards aren't being populated with updated when provided new logs as of ~48 hours ago. Example
  • ~~all other Tensorboards hang with a "Loading Tensorboard" message. Example: A model from 2022~~

Any chance we could get Tensorboard on https://status.huggingface.co/ ?

Edit:

Tensorboards no longer fail to start, however they're still not being updated with log files from last ~48 hours.

lapp0 avatar Sep 02 '24 15:09 lapp0

cc @XciD ^

julien-c avatar Sep 02 '24 18:09 julien-c

Appears to be resolved, thanks!

lapp0 avatar Sep 04 '24 04:09 lapp0

Re-occurring.

lapp0 avatar Sep 04 '24 14:09 lapp0

Here's the hacky script I'm using to render a hub tensorboard locally.

python3 run.py distily/distily_dataset_sweep

It retrieves all tfevent files from a repo, puts them in a temporary directory, and starts a tensorboard locally under that directory.

import os
import tempfile
from huggingface_hub import list_repo_files, hf_hub_download
from tensorboard import program
import time
import sys


def download_tensorboard_files(model_repo_id, temp_dir):
    # List all files in the Hugging Face Hub model repository
    repo_files = list_repo_files(model_repo_id)

    # Filter out only tensorboard event files (those containing "tfevents" in their name)
    tb_files = [f for f in repo_files if 'tfevents' in f]

    if not tb_files:
        print("No tensorboard files found in the repository.")
        return []

    for tb_file in tb_files:
        # Create subdirectories in the temp_dir as in the repo
        subdir = os.path.join(temp_dir, os.path.dirname(tb_file))
        os.makedirs(subdir, exist_ok=True)

        # Download the tensorboard event file to the corresponding subdirectory
        file_path = hf_hub_download(repo_id=model_repo_id, filename=tb_file, local_dir=subdir)
        print(f"Downloaded: {file_path}")

    return temp_dir


def run_tensorboard(log_dir):
    tb = program.TensorBoard()

    # Start TensorBoard pointing to the log_dir
    tb.configure(argv=[None, '--logdir', log_dir])
    url = tb.launch()
    print(f"TensorBoard is running at: {url}")

    # Infinite loop to keep the script running
    try:
        while True:
            time.sleep(60)
    except KeyboardInterrupt:
        print("TensorBoard process terminated.")


def main(model_repo_id):
    # Create a temporary directory
    temp_dir = tempfile.mkdtemp()

    try:
        # Download TensorBoard files to the temporary directory
        log_dir = download_tensorboard_files(model_repo_id, temp_dir)

        if log_dir:
            # Run TensorBoard on the downloaded log files
            run_tensorboard(log_dir)
        else:
            print("No TensorBoard logs to visualize.")
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        # Optionally, cleanup the temporary directory after use
        # shutil.rmtree(temp_dir) # Uncomment to clean up
        pass


if __name__ == "__main__":
    # Replace with your Hugging Face Hub model ID
    model_repo_id = sys.argv[1]
    main(model_repo_id)

lapp0 avatar Sep 08 '24 17:09 lapp0

Error re-occurring starting some time in the past few days.

lapp0 avatar Oct 22 '24 01:10 lapp0