tensorboard
tensorboard copied to clipboard
Restarting training job doesn't visualize earlier log files
I have a training job on a spot instance and a TB log on GCS-CSI fuse.
As it is a spot instance when the job it is going to reastart it going to create a new log file.
If I don't restart the TB the curves are ok but when I restart the TB service it going to read only then last log file.
Is there a ways to workaround this?
How are you running TensorBoard in this situation? Are the log files created in the same directory? Are all steps recomputed in the new file, or are only newer steps added?
Yes the log are created in the same directory. So I have the first log file and then a second log file that we the job restarting after the Google spot preemption. If I don't restart the TB all it is ok but when I restart it I see only the 2nd log. To avoid point overlapping in the curves I use purge_step using the last available checkpoint step.
Have you tried specifying --reload_multifile?
Yes I am using both --reload_multifile=true
and --detect_file_replacement=true
for the gcs csi fuse nature
Can you please provide the output of tensorboard --inspect --logdir <your log directory>
as well as the result of diagnose_tensorboard.py
as specified in the issue template?
tensorboard --inspect --logdir
With that command over the log directory where I have found the issue (there are 4 logs file there of the same run)
======================================================================
Processing event files... (this can take a few minutes)
======================================================================
Traceback (most recent call last):
File "/usr/local/bin/tensorboard", line 8, in <module>
sys.exit(run_main())
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/main.py", line 41, in run_main
app.run(tensorboard.main, flags_parser=tensorboard.configure)
File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 278, in main
return runner(self.flags) or 0
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 291, in _run_serve_subcommand
efi.inspect(flags.logdir, event_file, flags.tag)
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 451, in inspect
inspection_units = get_inspection_units(logdir, event_file, tag)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 405, in get_inspection_units
field_to_obs=get_field_to_observations_map(generator, tag),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 212, in get_field_to_observations_map
for event in generator:
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 244, in Load
for record in super().Load():
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 178, in Load
yield next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /logs/my_job10/log/tensorboard/events.out.tfevents.1710242212.mymodel.74.0; No such file or directory
And diagnose_tensorboard.py
//diagnose_tensorboard.py:32: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
import pipes
### Diagnostics
<details>
<summary>Diagnostics output</summary>
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=11, micro=0, releaselevel='candidate', serial=1)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='tensorboard-2342343-4324', release='6.1.58+', version='#1 SMP PREEMPT_DYNAMIC Mon Jan 29 15:19:25 UTC 2024', machine='x86_64')
INFO: sys.getwindowsversion(): N/A
--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None
--- check: installed_packages
INFO: installed: tensorboard==2.16.2
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.16.2'
--- check: tensorflow_python_version
2024-03-12 17:52:11.636867: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: tensorflow.__version__: '2.16.1'
INFO: tensorflow.__git_version__: 'v2.16.1-0-g5bc9d26649c'
--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.11/dist-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'
--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'
--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]
--- check: readable_fqdn
INFO: socket.getfqdn(): 'tensorboard-5f94958cbc-w59k5'
--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=1590, st_dev=3145768, st_nlink=2, st_uid=0, st_gid=0, st_size=4096, st_atime=1710242647, st_mtime=1710242639, st_ctime=1710242639)
INFO: mode: 0o40777
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.11/dist-packages']; bad_roots (0): []
--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.1.0
astunparse==1.6.3
blinker==1.4
certifi==2024.2.2
charset-normalizer==3.3.2
cryptography==3.4.8
dbus-python==1.2.18
distro==1.7.0
dm-tree==0.1.8
flatbuffers==23.5.26
gast==0.5.4
google-pasta==0.2.0
grpcio==1.62.0
h5py==3.10.0
httplib2==0.20.2
idna==3.6
importlib-metadata==4.6.4
jeepney==0.7.1
keras==3.0.5
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
libclang==16.0.6
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
ml-dtypes==0.3.2
more-itertools==8.10.0
namex==0.0.7
numpy==1.26.4
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.2
pip==24.0
protobuf==4.25.3
Pygments==2.17.2
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
python-apt==2.4.0+ubuntu3
requests==2.31.0
rich==13.7.0
SecretStorage==3.3.1
setuptools==69.1.1
six==1.16.0
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow-cpu==2.16.1
tensorflow-io-gcs-filesystem==0.36.0
termcolor==2.4.0
typing_extensions==4.9.0
urllib3==2.2.1
wadllib==1.3.6
Werkzeug==3.0.1
wheel==0.42.0
wrapt==1.16.0
zipp==1.0.0
</details>
The output of tensorboard --inspect --logdir
looks like you actually passed it tensorboard --inspect --event_file <file>
instead. There should not be an error, I would expect either No event files found within logdir <your logdir>
or
Found event files in:
<logdir>/<run>/...
<logdir>/<another run>/...
...
These tags are in <logdir>...
...
followed by event statistics.
Can you please attempt a rerun of tensorboard --inspect --logdir <your log directory>
pointed at your log directory and resend? Otherwise, I'm inclined to suspect that there has been some corruption of your log directory.
Ok this is on one of the runs where I found the issue (this one has 4 logs files in the run):
2024-03-12 19:30:25.911301: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
======================================================================
Processing event files... (this can take a few minutes)
======================================================================
Found event files in:
/logs/<my-run>/log/tensorboard
These tags are in /logs/<myrun>/log/tensorboard:
audio -
histograms -
images
Image1
scalars
LR
tensor -
======================================================================
Event statistics for /logs/<myrun>/log/tensorboard:
audio -
graph -
histograms -
images
first_step 0
last_step 21050
max_step 21050
min_step 0
num_steps 422
outoforder_steps [(5600, 5000), (12450, 12000), (12200, 12000)]
scalars
first_step 0
last_step 21050
max_step 21050
min_step 0
num_steps 422
outoforder_steps [(5600, 5000), (12450, 12000), (12200, 12000)]
sessionlog:checkpoint -
sessionlog:start
outoforder_steps []
steps [0, 5000, 12000, 12000]
sessionlog:stop -
tensor -
======================================================================
If you see instead scalars and images are starting from the last log (5k) after TB restart
P.s. I've renamed/removed few things for anonymization.
Would you mind running with --purge_orphaned_data=False
set and see if that helps? It doesn't seem like your situation is the same, but this may help us to debug (doc link).
--purge_orphaned_data=False
It doesn't solve it seems to me that it is always restarting the scalar graphs and the images from the last log file only. When I don't restart the service all it is ok on the web frontend.
Are you able to copy the log directory locally and confirm that things work as expect? Would like to rule out whether this is something internal to TB or isolated to just reading from GCS.
I've copied the same folder locally and it seems to work correctly. What I am using on Google cloud is https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver
Do things work as expected when pointing to GCS directly instead of using the locally mounted path?
Same bucket passing gcs://
it seems to work correctly
Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.
One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec
to see if that helps?
Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.
One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec
to see if that helps?
Ok I will try. In the meantime are you relying, with TB internals, on some features that conflict with fuse limitation expecting that that path is a standard POSIX filesystem path?
See: https://cloud.google.com/storage/docs/gcs-fuse?hl=en#differences-and-limitations
TB itself (when using the python filesystem readers) relies on TensorFlow's filesystem API via tf.io.gfile
when TF is installed (and falls back to a stub implementation if TF is not available which does not have as much support), see #5286.
When using RustBoard for file reading (which is the default, you can try disabling with --load_fast=False
to see if the python reader itself works in this situation), it looks like we use the Rust standard library path
and walkdir (rustboard disk logdir source). I am not aware of any limitations of those implementations.
@songjiaxun Any opinion about this on the CSI gcsfuse POSIX limitations?
@groszewn I've isolated the bug. It is going to happen when I use --logdir_spec
with multiple logs directory.
The first one in the list is always loaded correctly after the TB restart. All the other runs are going to load only the last log file. I've verified this removing one by one:
- as the log it is going to become the first one in the list it is loaded completely the other ones just the last log.
Now the problem is that on the bucket it is common to have a lot of subfolders/runs with TB logs files and I cannot load all the logs just specifying the main root. Other classical workarounds, like using softlink, are not available on GCS/fuse (but also on S3/fuse) as we cannot use softlinks in the buckets. So currently we don't have any solution to restrict a log selection.
Also for the nature of the GCS fuse we are constrained to use --detect_file_replacement=true
that currently in TB it is incompatible with --load_fast=true
and it will be very hard to rely on a single large parent log path without load_fast
.
I do not believe the team currently has bandwidth to add --detect_file_replacement
support to rustboard, but feel free to open another issue so that we can track +1s to help with prioritization.
Unfortunately usage of logdir_spec
is discouraged in favor of using symlinks. We rely on external providers for most filesystem/blob storage systems (e.g. tf.io.gfile
, fsspec
, s3fs
), so most specific needs such as this are only supported on a best-effort basis.
It is really going to be a problem to use symlinks as it is not supported in GCS fuse and also GCS fuse is going to write substituting file in the gcs bucket.
As we are trying to support an official Google cloud product with the TB can you investigate internally as it seems we don't have any other workaround.
Also this ticket it is still opened with this last comment asking for feedback: https://github.com/tensorflow/tensorboard/issues/349#issuecomment-1058728267
Do we need to open a new one?
Yes, please open a new ticket with your feature request (which seems to be support for --detect_file_replacement
in rustboard IIUC). As stated in the issue you linked, --detect_file_replacement
is an experimental feature with best-effort support, but we can gauge interest via +1s on the feature request issue to help with prioritization.
Is TB still actively developed? Or is it mainly in maintenance mode? I a relatively low commit/PR activity.
In any case, The ticket is here https://github.com/tensorflow/tensorboard/issues/6790.
Unfortunately usage of logdir_spec is discouraged in favor of using symlinks.
I still don't understand what filtering option we have on fs that don't support symlinks like Google gcs-csi-fuse