tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Fast data loading feedback (`--load_fast=true`; “RustBoard”)

Open wchargin opened this issue 4 years ago • 55 comments

This thread is for tracking feedback about TensorBoard’s experimental mode for fast data loading. Typical speedups range from 100× to 400×.

Who should try this: Anyone who’s found TensorBoard’s data loading to be slower than they’d like.

Who shouldn’t try this: Windows users (for now).

Feedback: Feedback form, or reply on this thread.

Try it out

To try this out, please uninstall all copies of TensorBoard and then install the latest version of tb-nightly:

pip uninstall -y tensorboard tb-nightly &&
pip install tb-nightly  # must have at least tb-nightly==2.5.0a20210316

Then, invoke TensorBoard with the --load_fast=true flag:

tensorboard --logdir /path/to/logs --load_fast true

Use TensorBoard as you usually would. It should work the same way, just faster.

Feedback

You can respond to this anonymous Google Form, or reply on this thread, or open a new issue. Let us know: did it work? how much faster was it? any suggestions or requests?

Known issues

We know about these, but please let us know if they matter for you, so that we can prioritize working on them:

  • Windows is not supported out of the box.
  • Some third-party plugins may need to be updated to work with this mode (e.g., the profile plugin).

FAQ

What does “data loading” include?

It includes time spent reading files in your logdir. It does not include time spent painting charts on the frontend.

What is the --load_fast flag?

Pass --load_fast=true to tell TensorBoard to use a new data loading mechanism, which is generally hundreds of times faster.

Is --load_fast=true right for me?

Currently, this mode is supported on Linux and macOS. If you are interested in using it on other platforms, ping @wchargin and I’ll show you how to build it.

Most features of TensorBoard are expected to work with the new data loading mechanism. All standard TensorBoard dashboards (scalars, images, etc.) should work, and flags like --reload_interval should work, too. You can use logdirs on local disk or on GCS buckets (public or private).

Do I need to have TensorFlow installed?

No.

What’s happening under the hood?

Instead of crawling your logdir in a mixture of Python and C++ code with a lot of locking, cross-language marshalling, and slow data manipulation in Python, we read the data in a dedicated subprocess. This program is written in Rust and is optimized for concurrent reading and serving. More design details here.

wchargin avatar Mar 16 '21 20:03 wchargin

Hello!

Very much interested in this, as we currently maintain a custom entrypoint to make Tensorboard work at all with our data sizes. Unfortunately, I can't get this to work anywhere. Using the latest nightly docker image I get the following error:

root@15bc33cc211f:/# tensorboard --logdir foobar --load_fast=true
Error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
Traceback (most recent call last):
  File "/usr/local/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/main.py", line 46, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 267, in main
    return runner(self.flags) or 0
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 283, in _run_serve_subcommand
    server = self._make_server()
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 433, in _make_server
    (data_provider, deprecated_multiplexer) = self._make_data_provider()
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 425, in _make_data_provider
    ingester.start()
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/data/server_ingester.py", line 150, in start
    % popen.poll()
RuntimeError: Data server exited with 1; check stderr for details

Presumably it tries to bind some port that's already in use by another process; unfortunately it doesn't say which one.

Also, it doesn't seem to work with logdir_spec, only logdir. This isn't a huge pain, but the error message just states that I didn't pass logdir -- it should probably explicitly state that load_fast and logdir_spec are incompatible.

tgolsson avatar Mar 19 '21 10:03 tgolsson

@tgolsson: Hi; thank you for your feedback! I hadn’t looked into Docker at all. We bind to port 0, which requests an arbitrary free port to the OS, so it looks like it’s not a port issue but an IPv6 host issue. I’ve filed #4801 and will take a look. I’ve posted therein what I think should be a workaround, in case you’re interested in that sort of thing.

edit: Fixed in #4804; confirmed fix in Docker nightlies.

Also, it doesn't seem to work with logdir_spec, only logdir. This isn't a huge pain, but the error message just states that I didn't pass logdir -- it should probably explicitly state that load_fast and logdir_spec are incompatible.

Yep. As of #4794, if you use --load_fast=auto, we’ll automatically detect unsupported invocations (including --logdir_spec) and fall back to the old codepaths. I can also try to make the error more explicit particularly for --logdir_spec. Filed #4802.

This is super helpful feedback; thank you.

wchargin avatar Mar 19 '21 17:03 wchargin

With tensorboard-plugin-profile (2.4.0) installed, I'm getting errors in the log:

Exception in thread DynamicProfilePluginIsActiveThread:
Traceback (most recent call last):
  File "/Users/till/homebrew2/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/till/homebrew2/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/till/tfnightly-py3.8/lib/python3.8/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 311, in compute_is_active
    self._is_active = any(self.generate_run_to_tools())
  File "/Users/till/tfnightly-py3.8/lib/python3.8/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 693, in generate_run_to_tools
    plugin_assets = self.multiplexer.PluginAssets(PLUGIN_NAME)
AttributeError: 'NoneType' object has no attribute 'PluginAssets'

(They disappear with --load_fast=false)

brychcy avatar Apr 08 '21 07:04 brychcy

Hi @brychcy—thanks! Yes, this is true. The profile plugin uses non-standard approaches to load its data and so won’t work out of the box with --load_fast. I’ll see if we can get it to work, but in the meantime you’ll have to either pass --load_fast=false (if you want to use the profile plugin) or uninstall the profile plugin package (if you don’t care about it and want to silence the errors).

Added a note to the “Known issues” section; thank you!

wchargin avatar Apr 08 '21 16:04 wchargin

@brychcy: I’ve sent the profiler folks a patch: https://github.com/tensorflow/profiler/issues/298

Their build appears to be pretty broken, so I’m not sure how long it will take them to integrate this and push a release.

wchargin avatar Apr 08 '21 17:04 wchargin

@wchargin Not quite feedback, but I'm wondering if there's any thoughts on multi-directory Rustboard (--logdir dir_a,dir_b in old syntax)? I started doing the work but figured I might ask in case it was intentionally removed or there's a WIP somewhere I'm not seeing.

tgolsson avatar Apr 29 '21 18:04 tgolsson

@tgolsson: Good question! I was thinking of instead supporting a more general mechanism that also resolves requests like #1708. Imagine something like:

$ tensorboard daemon start
$ tensorboard daemon add dir_a
$ tensorboard --daemon --bind_all
$ tensorboard daemon add dir_b

That is, you could add or remove log directories at runtime without having to relaunch TensorBoard or discarding existing loading progress, and also in a way that naturally supports remote filesystems and doesn't require setting up symlink trees.

Opened #4923 to track this, and would be happy to hear your thoughts.

wchargin avatar Apr 29 '21 22:04 wchargin

I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?

2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }

I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
   6210 tokio-run
   6210 Reloader-
   1035 StdinWatc
   1035 server
   1035 Reloader
    184 gmain
    168 gdbus
    134 grpc_glob
     85 bash
     80 snapd

Compared to <500 in "slow" mode.

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
    427 tensorboa
    184 gmain
    168 gdbus
     85 bash
     80 snapd
     72 systemd
     71 screen
     52 dconf\x20
     51 dbus-daem
     48 llvmpipe-

In my case, the "slow" mode actually loads files faster since it doesn't run into this issue.

Raphtor avatar May 11 '21 14:05 Raphtor

@Raphtor: interesting, thank you! Both the old and new codepaths keep an open fd for each event file, so I had considered this but expected it not to be a big problem. Let’s follow up in #4955.

wchargin avatar May 11 '21 16:05 wchargin

Using --load_fast under GKE with workload identity causes 401 Unauthorized error in rustboard_core::logdir when accessing GCS buckets.

It works fine if I set --load_fast=false.

sjincho avatar Jun 26 '21 04:06 sjincho

Fast data loading may be causing issues with the profiler https://github.com/tensorflow/profiler/issues/344 (one of several issues mentioning this problem recently) - a possible solution for now is to switch it off with %tensorboard --logdir=logs --load_fast=false cc @Terranlee @Jimicy @yisitu

8bitmp3 avatar Jul 22 '21 14:07 8bitmp3

Update: try the latest Profiler plugin v2.5 (pip install tensorboard_plugin_profile (or tensorboard_plugin_profile==2.5.0)). Then, launch (e.g. %tensorboard --logdir=logs without the --load_fast switch) and select Profiler. Thanks @yisitu 👍

8bitmp3 avatar Aug 02 '21 20:08 8bitmp3

You're welcome, happy to help!

yisitu avatar Aug 02 '21 22:08 yisitu

Anyone else landing here because they're following instructions from this link regarding using Tensorboard in AzureML?

jstremme avatar Sep 03 '21 15:09 jstremme

Closing as the issue has been resolved after I have released tensorboard_plugin_profile 2.5.0.

yisitu avatar Sep 03 '21 16:09 yisitu

Ah, we would like to keep this issue opened to solicit more feedbacks on the feature. Reopening.

stephanwlee avatar Sep 03 '21 16:09 stephanwlee

I see, I'll assign it back to you to track the feature.

yisitu avatar Sep 03 '21 16:09 yisitu

Hi, thank you for building this awesome function!

Is it possible to restrict the data server to communicate with only one TensorBoard process? I would appreciate it if this feature is supported because the current data server seems to be accessible by any users on a shared server though TensorBoard itself can have a simple passcode by specifying --path_prefix.

yoshipon avatar Oct 08 '21 06:10 yoshipon

I've got tensorflow output data and explore it with tensorboad as scalars. Usually I make use of the RELATIVE mode of Horizontal Axis and the graphs are displayed well. But with the --load_fast true option the graphs show the data as points (not varying along the X axis) instead of curves. The WALL mode shows only point as well. An example of my data is attached. train.zip

an-ivanov avatar Oct 09 '21 12:10 an-ivanov

Hi @tgolsson really curious about your implementation for large datasets as I'm trying to get tensorboard running for a few 100K. Have you just changed the hard coded limit (100K) in the typescript and rebuilt? What other changes have you made?

GeorgePearse avatar Nov 09 '21 16:11 GeorgePearse

I'm not sure what that limit is for, but I've never heard of it unfortunately. Our problem was related to having too much regular logging data (scalars, histograms/distributions, images) leading to an infinite queue of "refreshes" because they wouldn't finish before retries.

tgolsson avatar Nov 09 '21 17:11 tgolsson

Sorry @tgolsson, keep forgetting the number of other components to Tensorboard. My problems are specific to the embedding projector, but I guess that's not what you've had to solve. Thanks anyway!

GeorgePearse avatar Nov 09 '21 17:11 GeorgePearse

Hi, I have encountered a similar issue as #5116 when I use --load_fast=true (implicitly by default). The tf events are stored at a shared file system. ReadRecordError will lead to a termination of updating the latest event for those runs. When I use --load_fast=false, except for slow loading, there are no problems.

Jiayuan-Gu avatar Nov 11 '21 23:11 Jiayuan-Gu

Hi, on all compute clusters using our software stack, RustBoard hangs indefinitely at startup and has to be killed (with sigkill, sigterm isn't sufficient i.e. CTRL+C doesn't work). It does not reach the point where it prints something like TensorBoard x.y.z at http://0.0.0.0:PORT/ (Press CTRL+C to quit).

Here is a sample output using -v 1:

(env) [user01@login1 8]$ tensorboard --logdir ~/projects/def-sponsor00/$USER/out --host 0.0.0.0 --port 0 -v 1
TensorFlow installation not found - running with reduced feature set.
I0210 16:25:15.920281 139757372876608 server_ingester.py:290] Server binary (from Python package v0.6.1): /home/user01/env/lib/python3.8/site-packages/tensorboard_data_server/bin/server
I0210 16:25:15.922316 139757372876608 server_ingester.py:138] Spawning data server: ['/home/user01/env/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--logdir=/home/user01/projects/def-sponsor00/user01/out', '--reload=5', '--samples-per-plugin=', '--port=0', '--port-file=/tmp/tensorboard_data_server_rd4w992q/port', '--die-after-stdin', '--error-file=/tmp/tensorboard_data_server_rd4w992q/startup_error', '--verbose', '--verbose']
[2022-02-10T16:25:15Z DEBUG rustboard_core::cli] Parsed options: Opts { logdir: "/home/user01/projects/def-sponsor00/user01/out", host: "localhost", port: 0, reload: Loop { delay: 5s }, verbosity: 2, die_after_stdin: true, port_file: Some("/tmp/tensorboard_data_server_rd4w992q/port"), error_file: Some("/tmp/tensorboard_data_server_rd4w992q/startup_error"), checksum: false, no_checksum: false, samples_per_plugin: PluginSamplingHint({}) }
I0210 16:25:15.936901 139757372876608 server_ingester.py:160] Polling for data server port (attempt 0)
I0210 16:25:15.938199 139757372876608 server_ingester.py:162] Port file contents: None
[2022-02-10T16:25:15Z TRACE mio::poll] registering event source with poller: token=Token(0), interests=READABLE | WRITABLE
[2022-02-10T16:25:15Z INFO  rustboard_core::cli] Wrote port "36186" to /tmp/tensorboard_data_server_rd4w992q/port
[2022-02-10T16:25:15Z INFO  rustboard_core::cli] Starting load cycle
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Starting load for run "7/lightning_logs/version_7"
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Starting load for run "8/lightning_logs/version_8"
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Finished load for run "8/lightning_logs/version_8" (1.99877ms)
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Finished load for run "7/lightning_logs/version_7" (3.682665ms)
[2022-02-10T16:25:15Z INFO  rustboard_core::cli] Finished load cycle (8.249155ms)
I0210 16:25:16.439103 139757372876608 server_ingester.py:160] Polling for data server port (attempt 1)
I0210 16:25:16.439600 139757372876608 server_ingester.py:162] Port file contents: '36186\n'
[2022-02-10T16:25:20Z INFO  rustboard_core::cli] Starting load cycle
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Starting load for run "7/lightning_logs/version_7"
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Starting load for run "8/lightning_logs/version_8"
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Finished load for run "7/lightning_logs/version_7" (26.968µs)
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Finished load for run "8/lightning_logs/version_8" (22.896µs)
[2022-02-10T16:25:20Z INFO  rustboard_core::cli] Finished load cycle (6.097394ms)
< The last 6 lines repeat indefinitely >

I was wondering if we could use an environment variable to set load_fast=false by default on our clusters.

lemairecarl avatar Feb 10 '22 16:02 lemairecarl

I've had trouble with the --load_fast=true flag. When running tensorboard without setting --load_fast=false, I eventually start getting the following message repeated indefinitely (I've redacted directory names and usernames as XXX):

[2022-08-06T17:40:14Z WARN  rustboard_core::run] Failed to open event file EventFileBuf("XXX/20220806_040526/20220806_040526/events.out.tfevents.1659773299.XXX.XXX.XXX"): Os { code: 24, kind: Other, message: "Too many open files" }

When I get this message, Tensorboard fails to launch. However, I no longer get this message, and Tensorboard launches normally, if I pass --load_fast=false while launching Tensorboard.

mhdadk avatar Aug 06 '22 17:08 mhdadk

I am having trouble with --load_fast on an old server. I don't have access to GLibc-2.18 or above so I have to use patchelf. After a clean install of tensorboard in Python 3.8, when I call tensorboard --logdir . --load_fast=true

/home/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version `GLIBC_2.18' 
not found (required by /home/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server)

Then I patched server as follows,

patchelf --set-interpreter ~/scratch/mylib/glibc-2.18/lib/ld-linux-x86-64.so.2 --set-rpath ~/scratch/mylib/glibc-2.18/lib/ 
~/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server

tensorboard won't complain about libc.so.6 not found anymore. Instead, it now gives me this,

TensorFlow installation not found - running with reduced feature set.
Could not start data server: failed to bind to ("localhost", 0): failed to lookup address information: Name or service not known.

Do you have any ideas where could possibly go wrong? Thank you!


Updates well, I am able to use it if I install tfboard with conda via conda -c conda-forge tensorboard tensorboard-data-server since conda handles all the dependencies for me. But it would be nice if pip installed one also works.

drmeerkat avatar Aug 17 '22 07:08 drmeerkat

Using --load_fast under GKE with workload identity causes 401 Unauthorized error in rustboard_core::logdir when accessing GCS buckets.

It works fine if I set --load_fast=false.

Can this be considered to be a bug? Is there workaround to use --load_fast=true under GKE?

Thank you!

Corwinpro avatar Sep 05 '22 13:09 Corwinpro

Hi! I think I have a vague understanding of how the original issue can be solved. If someone could help me a bit with the last push, I believe we should be able to use --load_fast=true under GKE.

Corwinpro avatar Sep 09 '22 15:09 Corwinpro

Hi Corwinpro, there was some discussion about this and your PR. There are a couple folks willing to help you shepherd this into the repo. I've created a new issue for this specific error here:

https://github.com/tensorflow/tensorboard/issues/5934

Thanks for your patience and your contribution!

bmd3k avatar Sep 20 '22 14:09 bmd3k

authentication via default service account is indeed not working when using logdir in 2.8.0, we had to run with --load_fast=false to get it to work. Any plans to support default service account credentials? Also why was this experimental feature turned on by default?

samos123 avatar Aug 11 '23 04:08 samos123