tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Restarting training job doesn't visualize earlier log files

Open bhack opened this issue 11 months ago • 27 comments

I have a training job on a spot instance and a TB log on GCS-CSI fuse.

As it is a spot instance when the job it is going to reastart it going to create a new log file.

If I don't restart the TB the curves are ok but when I restart the TB service it going to read only then last log file.

Is there a ways to workaround this?

bhack avatar Mar 11 '24 14:03 bhack

How are you running TensorBoard in this situation? Are the log files created in the same directory? Are all steps recomputed in the new file, or are only newer steps added?

groszewn avatar Mar 11 '24 17:03 groszewn

Yes the log are created in the same directory. So I have the first log file and then a second log file that we the job restarting after the Google spot preemption. If I don't restart the TB all it is ok but when I restart it I see only the 2nd log. To avoid point overlapping in the curves I use purge_step using the last available checkpoint step.

bhack avatar Mar 11 '24 18:03 bhack

Have you tried specifying --reload_multifile?

groszewn avatar Mar 11 '24 20:03 groszewn

Yes I am using both --reload_multifile=true and --detect_file_replacement=true for the gcs csi fuse nature

bhack avatar Mar 11 '24 21:03 bhack

Can you please provide the output of tensorboard --inspect --logdir <your log directory> as well as the result of diagnose_tensorboard.py as specified in the issue template?

groszewn avatar Mar 12 '24 15:03 groszewn

tensorboard --inspect --logdir

With that command over the log directory where I have found the issue (there are 4 logs file there of the same run)

======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Traceback (most recent call last):
  File "/usr/local/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/main.py", line 41, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 278, in main
    return runner(self.flags) or 0
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 291, in _run_serve_subcommand
    efi.inspect(flags.logdir, event_file, flags.tag)
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 451, in inspect
    inspection_units = get_inspection_units(logdir, event_file, tag)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 405, in get_inspection_units
    field_to_obs=get_field_to_observations_map(generator, tag),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 212, in get_field_to_observations_map
    for event in generator:
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 244, in Load
    for record in super().Load():
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 178, in Load
    yield next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /logs/my_job10/log/tensorboard/events.out.tfevents.1710242212.mymodel.74.0; No such file or directory

And diagnose_tensorboard.py

//diagnose_tensorboard.py:32: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
  import pipes
### Diagnostics

<details>
<summary>Diagnostics output</summary>
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=11, micro=0, releaselevel='candidate', serial=1)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='tensorboard-2342343-4324', release='6.1.58+', version='#1 SMP PREEMPT_DYNAMIC Mon Jan 29 15:19:25 UTC 2024', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.16.2
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.16.2'

--- check: tensorflow_python_version
2024-03-12 17:52:11.636867: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: tensorflow.__version__: '2.16.1'
INFO: tensorflow.__git_version__: 'v2.16.1-0-g5bc9d26649c'

--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.11/dist-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'tensorboard-5f94958cbc-w59k5'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=1590, st_dev=3145768, st_nlink=2, st_uid=0, st_gid=0, st_size=4096, st_atime=1710242647, st_mtime=1710242639, st_ctime=1710242639)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.11/dist-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.1.0
astunparse==1.6.3
blinker==1.4
certifi==2024.2.2
charset-normalizer==3.3.2
cryptography==3.4.8
dbus-python==1.2.18
distro==1.7.0
dm-tree==0.1.8
flatbuffers==23.5.26
gast==0.5.4
google-pasta==0.2.0
grpcio==1.62.0
h5py==3.10.0
httplib2==0.20.2
idna==3.6
importlib-metadata==4.6.4
jeepney==0.7.1
keras==3.0.5
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
libclang==16.0.6
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
ml-dtypes==0.3.2
more-itertools==8.10.0
namex==0.0.7
numpy==1.26.4
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.2
pip==24.0
protobuf==4.25.3
Pygments==2.17.2
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
python-apt==2.4.0+ubuntu3
requests==2.31.0
rich==13.7.0
SecretStorage==3.3.1
setuptools==69.1.1
six==1.16.0
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow-cpu==2.16.1
tensorflow-io-gcs-filesystem==0.36.0
termcolor==2.4.0
typing_extensions==4.9.0
urllib3==2.2.1
wadllib==1.3.6
Werkzeug==3.0.1
wheel==0.42.0
wrapt==1.16.0
zipp==1.0.0

</details>

bhack avatar Mar 12 '24 17:03 bhack

The output of tensorboard --inspect --logdir looks like you actually passed it tensorboard --inspect --event_file <file> instead. There should not be an error, I would expect either No event files found within logdir <your logdir> or

Found event files in:
<logdir>/<run>/...
<logdir>/<another run>/...
...
These tags are in <logdir>...
...

followed by event statistics.

Can you please attempt a rerun of tensorboard --inspect --logdir <your log directory> pointed at your log directory and resend? Otherwise, I'm inclined to suspect that there has been some corruption of your log directory.

groszewn avatar Mar 12 '24 18:03 groszewn

Ok this is on one of the runs where I found the issue (this one has 4 logs files in the run):

2024-03-12 19:30:25.911301: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Found event files in:
/logs/<my-run>/log/tensorboard

These tags are in /logs/<myrun>/log/tensorboard:
audio -
histograms -
images
  Image1
scalars
   LR
tensor -
======================================================================

Event statistics for /logs/<myrun>/log/tensorboard:
audio -
graph -
histograms -
images
   first_step           0
   last_step            21050
   max_step             21050
   min_step             0
   num_steps            422
   outoforder_steps     [(5600, 5000), (12450, 12000), (12200, 12000)]
scalars
   first_step           0
   last_step            21050
   max_step             21050
   min_step             0
   num_steps            422
   outoforder_steps     [(5600, 5000), (12450, 12000), (12200, 12000)]
sessionlog:checkpoint -
sessionlog:start
   outoforder_steps     []
   steps                [0, 5000, 12000, 12000]
sessionlog:stop -
tensor -
======================================================================

If you see instead scalars and images are starting from the last log (5k) after TB restart immagine

P.s. I've renamed/removed few things for anonymization.

bhack avatar Mar 12 '24 19:03 bhack

Would you mind running with --purge_orphaned_data=False set and see if that helps? It doesn't seem like your situation is the same, but this may help us to debug (doc link).

groszewn avatar Mar 12 '24 20:03 groszewn

--purge_orphaned_data=False

It doesn't solve it seems to me that it is always restarting the scalar graphs and the images from the last log file only. When I don't restart the service all it is ok on the web frontend.

bhack avatar Mar 12 '24 20:03 bhack

Are you able to copy the log directory locally and confirm that things work as expect? Would like to rule out whether this is something internal to TB or isolated to just reading from GCS.

groszewn avatar Mar 12 '24 21:03 groszewn

I've copied the same folder locally and it seems to work correctly. What I am using on Google cloud is https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver

bhack avatar Mar 12 '24 22:03 bhack

Do things work as expected when pointing to GCS directly instead of using the locally mounted path?

groszewn avatar Mar 13 '24 00:03 groszewn

Same bucket passing gcs:// it seems to work correctly

bhack avatar Mar 13 '24 01:03 bhack

Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.

One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec to see if that helps?

groszewn avatar Mar 13 '24 14:03 groszewn

Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.

One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec to see if that helps?

groszewn avatar Mar 13 '24 14:03 groszewn

Ok I will try. In the meantime are you relying, with TB internals, on some features that conflict with fuse limitation expecting that that path is a standard POSIX filesystem path?

See: https://cloud.google.com/storage/docs/gcs-fuse?hl=en#differences-and-limitations

bhack avatar Mar 13 '24 14:03 bhack

TB itself (when using the python filesystem readers) relies on TensorFlow's filesystem API via tf.io.gfile when TF is installed (and falls back to a stub implementation if TF is not available which does not have as much support), see #5286.

When using RustBoard for file reading (which is the default, you can try disabling with --load_fast=False to see if the python reader itself works in this situation), it looks like we use the Rust standard library path and walkdir (rustboard disk logdir source). I am not aware of any limitations of those implementations.

groszewn avatar Mar 13 '24 14:03 groszewn

@songjiaxun Any opinion about this on the CSI gcsfuse POSIX limitations?

bhack avatar Mar 13 '24 14:03 bhack

@groszewn I've isolated the bug. It is going to happen when I use --logdir_spec with multiple logs directory.

The first one in the list is always loaded correctly after the TB restart. All the other runs are going to load only the last log file. I've verified this removing one by one:

  • as the log it is going to become the first one in the list it is loaded completely the other ones just the last log.

Now the problem is that on the bucket it is common to have a lot of subfolders/runs with TB logs files and I cannot load all the logs just specifying the main root. Other classical workarounds, like using softlink, are not available on GCS/fuse (but also on S3/fuse) as we cannot use softlinks in the buckets. So currently we don't have any solution to restrict a log selection.

bhack avatar Mar 13 '24 20:03 bhack

Also for the nature of the GCS fuse we are constrained to use --detect_file_replacement=true that currently in TB it is incompatible with --load_fast=true and it will be very hard to rely on a single large parent log path without load_fast.

bhack avatar Mar 13 '24 20:03 bhack

I do not believe the team currently has bandwidth to add --detect_file_replacement support to rustboard, but feel free to open another issue so that we can track +1s to help with prioritization.

Unfortunately usage of logdir_spec is discouraged in favor of using symlinks. We rely on external providers for most filesystem/blob storage systems (e.g. tf.io.gfile, fsspec, s3fs), so most specific needs such as this are only supported on a best-effort basis.

groszewn avatar Mar 14 '24 19:03 groszewn

It is really going to be a problem to use symlinks as it is not supported in GCS fuse and also GCS fuse is going to write substituting file in the gcs bucket.

As we are trying to support an official Google cloud product with the TB can you investigate internally as it seems we don't have any other workaround.

bhack avatar Mar 14 '24 21:03 bhack

Also this ticket it is still opened with this last comment asking for feedback: https://github.com/tensorflow/tensorboard/issues/349#issuecomment-1058728267

Do we need to open a new one?

bhack avatar Mar 15 '24 01:03 bhack

Yes, please open a new ticket with your feature request (which seems to be support for --detect_file_replacement in rustboard IIUC). As stated in the issue you linked, --detect_file_replacement is an experimental feature with best-effort support, but we can gauge interest via +1s on the feature request issue to help with prioritization.

groszewn avatar Mar 15 '24 18:03 groszewn

Is TB still actively developed? Or is it mainly in maintenance mode? I a relatively low commit/PR activity.

bhack avatar Mar 15 '24 18:03 bhack

In any case, The ticket is here https://github.com/tensorflow/tensorboard/issues/6790.

Unfortunately usage of logdir_spec is discouraged in favor of using symlinks.

I still don't understand what filtering option we have on fs that don't support symlinks like Google gcs-csi-fuse

bhack avatar Mar 15 '24 19:03 bhack