tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

tensorboard_data_server is not valid manylinux2010 wheel

Open aphedges opened this issue 2 years ago • 7 comments

When I run TensorBoard, I get the following error: lib/python3.9/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version 'GLIBC_2.18' not found (required by lib/python3.9/site-packages/tensorboard_data_server/bin/server). (I removed irrelevant path parts, as well as fixed a quote so Markdown formatting works.) Although I did not encounter any other problems while using Tensorboard, the error message is concerning.

I am running on CentOS 7, and the most recent version of glibc on my system is GLIBC_2.17.[^1] This is consistent with PEP 599, which states that the manylinux2014 platform tag is based on CentOS 7 and supports up to GLIBC_2.17. According to PEP 571, manylinux2010 only supports up to GLIBC_2.12, which is what TensorFlow is compiled with.

Given that you said you target manylinux2010 because TensorFlow does, I searched the TensorFlow repository. There is a complex build process spread across many files, but their build process seems to specifically download an older version of glibc just to create wheels. For an example, look at tensorflow/tools/ci_build/Dockerfile.rbe.cuda11.1-cudnn8-ubuntu18.04-manylinux2010-multipython and tensorflow/tools/ci_build/devtoolset/build_devtoolset.sh.

I don't know if modifying the build process is extremely important (I can't seem to find any other glibc-related bugs in the issue tracker), but hopefully the scripts in the main TensorFlow repository can help fix the workflow if you deem it worth doing.

[^1]: I found it by running strings /lib64/libc.so.6 | grep '^GLIBC_' | sort --sort=version | uniq, which is based on https://gist.github.com/michaelchughes/85287f1c6f6440c060c3d86b4e7d764b#check-the-old-location-of-libcso6.

Originally posted by @aphedges in https://github.com/tensorflow/tensorboard/issues/4928#issuecomment-1032227950

This got no response in a closed issue, so I figured I'd make a new one.

aphedges avatar Sep 24 '22 02:09 aphedges

Thanks for filing this issue. I am going to take some time and look into this this week. To be clear you get this error message but Tensorboard is running just fine correct?

JamesHollyer avatar Sep 27 '22 20:09 JamesHollyer

As far as I can tell, Tensorboard is working without any issues. However, I don't have a lot of experience with Tensorboard and have spent even less time using it on other computers, so it's difficult for me to tell. It's quite possible I just haven't encountered a broken feature yet.

aphedges avatar Sep 27 '22 20:09 aphedges

Could you share any other log messages that are printed out by TensorBoard when you run it with --verbosity=0? I wonder if the glibc error just means the Rust-based data server isn't launching and you're falling back to the Python logic, in which case everything should still work, it's just that you won't be getting the enhanced performance of the data server.

nfelt avatar Sep 28 '22 15:09 nfelt

Sure! Here is the initial output:

$ tensorboard --verbosity=0 --logdir . --bind_all
TensorFlow installation not found - running with reduced feature set.
I0928 11:33:21.172642 140398046992192 server_ingester.py:290] Server binary (from Python package v0.6.1): /nas/home/ahedges/.pyenv/versions/3.10.6/envs/peko/lib/python3.10/site-packages/tensorboard_data_server/bin/server
I0928 11:33:21.173444 140398046992192 server_ingester.py:138] Spawning data server: ['/nas/home/ahedges/.pyenv/versions/3.10.6/envs/peko/lib/python3.10/site-packages/tensorboard_data_server/bin/server', '--logdir=.', '--reload=5', '--samples-per-plugin=', '--port=0', '--port-file=/tmp/tensorboard_data_server_12re8kvb/port', '--die-after-stdin', '--error-file=/tmp/tensorboard_data_server_12re8kvb/startup_error', '--verbose']
I0928 11:33:21.221188 140398046992192 server_ingester.py:160] Polling for data server port (attempt 0)
I0928 11:33:21.221515 140398046992192 server_ingester.py:162] Port file contents: None
/nas/home/ahedges/.pyenv/versions/3.10.6/envs/peko/lib/python3.10/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by /nas/home/ahedges/.pyenv/versions/3.10.6/envs/peko/lib/python3.10/site-packages/tensorboard_data_server/bin/server)
I0928 11:33:21.722353 140398046992192 program.py:444] Data server error: exited with 1; check stderr for details; falling back to multiplexer
I0928 11:33:21.722964 140398046992192 plugin_event_multiplexer.py:106] Event Multiplexer initializing.
I0928 11:33:21.723106 140398046992192 plugin_event_multiplexer.py:126] Event Multiplexer done initializing
I0928 11:33:21.723209 140398046992192 data_ingester.py:128] Launching reload in a daemon thread
I0928 11:33:21.723505 140397729126144 data_ingester.py:102] TensorBoard reload process beginning
I0928 11:33:21.723621 140397729126144 plugin_event_multiplexer.py:203] Starting AddRunsFromDirectory: .
I0928 11:33:21.723885 140397729126144 io_wrapper.py:215] GetLogdirSubdirectories: Starting to list directories via walking.
I0928 11:33:21.746901 140397729126144 plugin_event_multiplexer.py:205] Adding run from directory ./log_isAfter_2022_09_24_02_56_42/test
I0928 11:33:21.747088 140397729126144 plugin_event_multiplexer.py:160] Constructing EventAccumulator for ./log_isAfter_2022_09_24_02_56_42/test
I0928 11:33:21.751899 140397729126144 plugin_event_multiplexer.py:205] Adding run from directory ./log_2022_09_23_00_04_53/test
I0928 11:33:21.752059 140397729126144 plugin_event_multiplexer.py:160] Constructing EventAccumulator for ./log_2022_09_23_00_04_53/test
I0928 11:33:21.756797 140397729126144 plugin_event_multiplexer.py:205] Adding run from directory ./log_HasSubEvent_2022_09_24_02_56_37/test
I0928 11:33:21.756966 140397729126144 plugin_event_multiplexer.py:160] Constructing EventAccumulator for ./log_HasSubEvent_2022_09_24_02_56_37/test
I0928 11:33:21.761711 140397729126144 plugin_event_multiplexer.py:205] Adding run from directory ./log_isBefore_2022_09_24_02_56_45/test
I0928 11:33:21.761876 140397729126144 plugin_event_multiplexer.py:160] Constructing EventAccumulator for ./log_isBefore_2022_09_24_02_56_45/test
I0928 11:33:21.762956 140397729126144 plugin_event_multiplexer.py:209] Done with AddRunsFromDirectory: .
I0928 11:33:21.763048 140397729126144 data_ingester.py:105] TensorBoard reload process: Reload the whole Multiplexer
I0928 11:33:21.763092 140397729126144 plugin_event_multiplexer.py:214] Beginning EventMultiplexer.Reload()
I0928 11:33:21.763177 140397729126144 plugin_event_multiplexer.py:257] Reloading runs serially (one after another) on the main thread.
I0928 11:33:22.577528 140397729126144 directory_watcher.py:123] No path found after ./log_isAfter_2022_09_24_02_56_42/test/events.out.tfevents.1663988208.sagalg11
TensorBoard 2.10.0 at http://piranha-sub-01.isi.edu:6006/ (Press CTRL+C to quit)
I0928 11:33:22.875307 140397729126144 directory_watcher.py:123] No path found after ./log_2022_09_23_00_04_53/test/events.out.tfevents.1663891500.sagalg11
I0928 11:33:24.005398 140397729126144 directory_watcher.py:123] No path found after ./log_HasSubEvent_2022_09_24_02_56_37/test/events.out.tfevents.1663988208.sagalg11
I0928 11:33:24.238435 140397729126144 directory_watcher.py:123] No path found after ./log_isBefore_2022_09_24_02_56_45/test/events.out.tfevents.1663988210.sagalg11
I0928 11:33:24.238598 140397729126144 plugin_event_multiplexer.py:267] Finished with EventMultiplexer.Reload()
I0928 11:33:24.238654 140397729126144 data_ingester.py:110] TensorBoard done reloading. Load took 2.515 secs

After this, the TensorBoard reloading runs repeatedly, so I didn't include the later output.

aphedges avatar Sep 28 '22 18:09 aphedges

Thanks! This line shows that it is doing what I suspected (falling back to Python):

I0928 11:33:21.722353 140398046992192 program.py:444] Data server error: exited with 1; check stderr for details; falling back to multiplexer

So this shouldn't cause problems for your use of TensorBoard. I agree ideally we would either adjust the data server build process to work with the older GLIBC, or fix the package so that it handles this failure case more gracefully (e.g. swallows the error rather than printing it out), but realistically it probably won't be a very high priority for us, since I don't think there is a functional issue (and that data server is known to only support a limited number of platforms).

nfelt avatar Sep 28 '22 18:09 nfelt

That's very good to know! Thanks for investigating.

aphedges avatar Sep 28 '22 18:09 aphedges

Building tensorboard_data_server for aarch64 (with Spack) fails with the following error message. I think it's because manylinux2010 doesn't support aarch64. Could you fix this in the near future?

ERROR: tensorboard_data_server-0.6.1-py3-none-manylinux2010_aarch64.whl is not a supported wheel on this platform.

h-murai avatar Jul 03 '23 02:07 h-murai