tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Tensorboard 2.9.1 --logdir as aws s3 path

Open Krasner opened this issue 1 year ago • 6 comments

I am using Tensorboard 2.9.1, when setting --logdir as s3://<bucket>/<folder> tensorboard is not able to read event files.

On my machine (EC2 instance) i am able to reach that logdir via aws cli (aws s3 ls s3://<bucket>/<folder>). In python I can also reach the files in that folder using tensorflow_io:

import tensorflow as tf
import tensorflow_io as tfio

data  = tf.io.read_file("s3://<bucket>/<folder>/<file>")

This is the Tensorboard command:

AWS_REGION=us-east-1 S3_REGION=us-east-1 S3_ENDPOINT=s3.us-east-1.amazonaws.com S3_USE_HTTPS=1 S3_VERIFY_SSL=0 AWS_LOG_LEVEL=1 CUDA_VISIBLE_DEVICES="" tensorboard --logdir s3://<bucket>/<folder> --host 0.0.0.0

This is the error code:

2023-04-28 14:37:22.039640: W tensorflow/c/logging.cc:37] Retry Strategy will use the default max attempts.
2023-04-28 14:37:22.039685: W tensorflow/c/logging.cc:37] Retry Strategy will use the default max attempts.
2023-04-28 14:37:22.039714: W tensorflow/c/logging.cc:37] Token file must be specified to use STS AssumeRole web identity creds provider.
2023-04-28 14:37:22.039730: W tensorflow/c/logging.cc:37] Retry Strategy will use the default max attempts.
2023-04-28 14:37:22.068841: E tensorflow/c/logging.cc:40] HTTP response code: 404
Resolved remote host IP address:
Request ID:
Exception name:
Error message: No response body.
5 response headers:
content-type : application/xml
date : Fri, 28 Apr 2023 14:37:21 GMT
server : AmazonS3
x-amz-id-2 : Am5XM8hPcYQIbatGgTDYxOo0yxcPBkGFh5tg5tdM1bor4zc9Yzb1jkBZ0cd0rjaJ1XXJXoHk/tY=
x-amz-request-id : RNH62MB09RNQRT3H
2023-04-28 14:37:22.068889: W tensorflow/c/logging.cc:37] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2023-04-28 14:37:22.097549: E tensorflow/c/logging.cc:40] HTTP response code: 404

I would expect Tensorboard to use Tensorflow_IO's tensorflow_io/core/filesystems/s3/ but from the message above that does not seem to be happening. Notice in the diagnostics report I am using tensorflow-io==0.26.0 and tensorflow-io-gcs-filesystem==0.26.0

Additionally I tried running tensorboard from a python script but get the same problem:

import os
import tensorflow as tf
import tensorflow_io as tfio
from tensorboard import program

os.environ["AWS_REGION"]="us-east-1"
os.environ["S3_REGION"]="us-east-1"
os.environ["S3_ENDPOINT"]="s3.us-east-1.amazonaws.com"
os.environ["S3_USE_HTTPS"]="1"
os.environ["S3_VERIFY_SSL"]="0"
os.environ["AWS_LOG_LEVEL"]="1"

tracking_address = 's3://<bucket>/<folder>' # the path of your log file.
host_ip = "0.0.0.0"

if __name__ == "__main__":
    tb = program.TensorBoard()
    tb.configure(argv=[None, '--logdir', tracking_address, '--bind_all'])
    url = tb.launch()
    print(f"Tensorflow listening on {url}")

Environment information (required)

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='ip-xxx-xx-xx-xxx', release='5.15.0-1026-aws', version='#30~20.04.2-Ubuntu SMP Fri Nov 25 14:53:22 UTC 2022', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.9.1
INFO: installed: tensorflow==2.9.2
INFO: installed: tensorflow-estimator==2.9.0
INFO: installed: tensorboard-data-server==0.6.1

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.9.1'

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '2.9.2'
INFO: tensorflow.__git_version__: 'v2.9.1-132-g18960c44ad3'

--- check: tensorboard_data_server_version
INFO: data server binary: '/home/ubuntu/.local/lib/python3.9/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.6.1'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/ubuntu/.local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'ip-xxx-xx-xx-xxx.ec2.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=144084, st_dev=66306, st_nlink=2, st_uid=1000, st_gid=1000, st_size=4096, st_atime=1682631005, st_mtime=1682691735, st_ctime=1682691735)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/ubuntu/.local/lib/python3.9/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==1.3.0
aiohttp==3.8.1
aiohttp-cors==0.7.0
aiosignal==1.3.1
alabaster==0.7.12
albumentations==1.2.0
alembic==1.10.3
antlr4-python3-runtime==4.9.3
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
astunparse==1.6.3
async-generator==1.10
async-timeout==4.0.2
attrs==21.4.0
Automat==0.8.0
autopage==0.5.1
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.11.1
black==22.12.0
bleach==5.0.1
blessed==1.20.0
blinker==1.4
bokeh==3.0.2
boto3==1.22.6
botocore==1.25.13
cachetools==5.2.0
certifi==2022.12.7
cffi==1.15.1
cfgv==3.3.1
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
cliff==4.2.0
cloud-init==23.1.2
cloudpickle==2.0.0
cmaes==0.9.1
cmd2==2.4.3
colorama==0.4.3
colorful==0.5.5
colorlog==6.7.0
comm==0.1.2
command-not-found==0.3
commonmark==0.9.1
configobj==5.0.6
constantly==15.1.0
contourpy==1.0.6
conversions==0.0.2
cryptography==2.8
curio==1.6
cycler==0.11.0
Cython==0.29.32
dbus-python==1.2.16
debugpy==1.6.4
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
distinctipy==1.2.2
distlib==0.3.6
distro==1.4.0
distro-info===0.23ubuntu1
dm-tree==0.1.7
docker-pycreds==0.4.0
docrepr==0.2.0
docutils==0.17.1
ec2-hibinit-agent==1.0.0
edward2==0.0.2
einops==0.4.1
entrypoints==0.3
etils==0.9.0
exceptiongroup==1.0.4
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.8.2
flatbuffers==1.12
focal-loss==0.0.7
fonttools==4.38.0
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2022.11.0
gast==0.4.0
gin-config==0.5.0
gitdb==4.0.10
GitPython==3.1.29
google-api-core==2.11.0
google-api-python-client==2.69.0
google-auth==2.15.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.57.0
gpustat==1.1
greenlet==2.0.2
grpcio==1.43.0
gviz-api==1.10.0
h5py==3.7.0
hibagent==1.0.1
html2text==2020.1.16
httplib2==0.21.0
hydra-colorlog==1.1.0
hydra-core==1.2.0
hydra-joblib-launcher==1.2.0
hydra-optuna-sweeper==1.2.0
hydra-ray-launcher==1.2.0
hyperlink==19.0.0
identify==2.5.9
idna==3.4
imageio==2.22.4
imagesize==1.4.1
importlib-metadata==4.13.0
importlib-resources==5.10.1
incremental==16.10.1
iniconfig==1.1.1
ipykernel==6.19.2
ipympl==0.9.2
ipyparallel==8.4.1
ipython==8.7.0
ipython-genutils==0.2.0
ipywidgets==8.0.4
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.1.1
jsonpatch==1.22
jsonpointer==2.0
jsonschema==4.17.3
jupyter-events==0.5.0
jupyter_client==7.4.8
jupyter_core==5.1.0
jupyter_server==2.0.5
jupyter_server_terminals==0.4.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.5
kaggle==1.5.12
keras==2.9.0
Keras-Preprocessing==1.1.2
keyring==18.0.1
kiwisolver==1.4.4
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
libclang==14.0.6
llvmlite==0.39.1
lxml==4.9.1
Mako==1.2.4
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.5.1
matplotlib-inline==0.1.6
mistune==2.0.4
more-itertools==4.2.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==0.4.3
nbclassic==0.4.8
nbclient==0.7.2
nbconvert==7.2.7
nbformat==5.7.1
nest-asyncio==1.5.6
netifaces==0.10.4
networkx==2.8.8
nmslib==2.1.1
nodeenv==1.7.0
notebook==6.5.2
notebook_shim==0.2.2
numba==0.56.4
numpy==1.21.5
nvidia-ml-py==11.525.112
oauth2client==4.1.3
oauthlib==3.2.2
omegaconf==2.2.2
opencensus==0.11.2
opencensus-context==0.1.3
opencv-python==4.6.0.66
opencv-python-headless==4.6.0.66
opt-einsum==3.3.0
optuna==2.10.1
outcome==1.2.0
packaging==22.0
pandas==1.4.3
pandocfilters==1.5.0
parso==0.8.3
pathos==0.3.0
pathspec==0.10.3
pathtools==0.1.2
pbr==5.11.1
pexpect==4.6.0
pickle5==0.0.11
pickleshare==0.7.5
Pillow==9.3.0
pip==23.1.2
platformdirs==2.6.0
pluggy==1.0.0
portalocker==2.6.0
pox==0.3.2
ppft==1.7.6.6
pre-commit==2.20.0
prettytable==3.7.0
prometheus-client==0.13.1
promise==2.3
prompt-toolkit==3.0.36
protobuf==3.19.6
protobuf3-to-dict==0.1.5
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
py-cpuinfo==9.0.0
py-spy==0.3.14
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.6.1
pycocotools==2.0.6
pycparser==2.21
pydash==5.1.2
Pygments==2.13.0
PyGObject==3.36.0
pygwalker==0.1.4
PyHamcrest==1.9.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pynndescent==0.5.8
PyOpenGL==3.1.6
pyOpenSSL==19.0.0
pyparsing==3.0.9
pyperclip==1.8.2
PyQt5==5.14.1
PyQt6==6.4.0
PyQt6-Qt6==6.4.1
PyQt6-sip==13.4.0
pyqtgraph==0.13.1
pyrsistent==0.15.5
pyserial==3.4
pytest==6.2.5
pytest-asyncio==0.20.3
python-apt==2.0.0+ubuntu0.20.4.8
python-dateutil==2.8.2
python-debian===0.1.36ubuntu1
python-json-logger==2.0.4
python-slugify==7.0.0
python-version==0.0.2
pytz==2022.6
PyWavelets==1.4.1
PyYAML==5.4.1
pyzmq==24.0.1
qtconsole==5.4.0
QtPy==2.3.0
qudida==0.0.4
ray==1.12.0
recommonmark==0.7.1
regex==2022.10.31
requests==2.28.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==12.6.0
rsa==4.9
s3fs==0.4.2
s3transfer==0.5.2
sacrebleu==2.3.1
sagemaker==2.109.0
scikit-image==0.18.3
scikit-learn==1.1.1
scipy==1.7.3
seaborn==0.12.1
SecretStorage==2.3.1
Send2Trash==1.8.0
sentencepiece==0.1.97
sentry-sdk==1.11.1
seqeval==1.2.2
service-identity==18.1.0
setproctitle==1.3.2
setuptools==61.2.0
shortuuid==1.0.11
simplejson==3.16.0
sip==4.19.21
six==1.16.0
smart-open==6.3.0
smdebug-rulesconfig==1.0.1
smmap==5.0.0
sniffio==1.3.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
sos==4.4
soupsieve==2.3.2.post1
Sphinx==5.3.0
sphinx-markdown-builder==0.5.5
sphinx-rtd-theme==1.1.1
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
SQLAlchemy==2.0.9
ssh-import-id==5.10
stack-data==0.6.2
stevedore==5.0.0
systemd-python==234
tabulate==0.9.0
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.11.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.2
tensorflow-addons==0.17.1
tensorflow-datasets==4.7.0
tensorflow-decision-forests==0.2.7
tensorflow-estimator==2.9.0
tensorflow-hub==0.12.0
tensorflow-io==0.26.0
tensorflow-io-gcs-filesystem==0.26.0
tensorflow-metadata==1.12.0
tensorflow-model-optimization==0.7.3
tensorflow-probability==0.17.0
tensorflow-similarity==0.16.8
tensorflow-text==2.9.0
termcolor==2.1.1
terminado==0.17.1
testpath==0.6.0
text-unidecode==1.3
tf-models-official==2.9.2
tf-slim==1.1.0
threadpoolctl==3.1.0
tifffile==2022.10.10
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.2
tqdm==4.64.1
traitlets==5.7.0
trio==0.22.0
Twisted==18.9.0
typeguard==2.13.3
typing_extensions==4.4.0
ubuntu-advantage-tools==27.12
ufw==0.36
umap-learn==0.5.3
unattended-upgrades==0.1
unify==0.5
untokenize==0.1.1
uri-template==1.2.0
uritemplate==4.1.1
urllib3==1.26.13
validators==0.20.0
virtualenv==20.17.1
vit-keras==0.1.0
wadllib==1.3.3
wandb==0.12.18
wcwidth==0.2.5
webcolors==1.12
webencodings==0.5.1
websocket-client==1.4.2
Werkzeug==2.2.2
wheel==0.38.4
widgetsnbextension==4.0.5
wrapt==1.14.1
wurlitzer==3.0.3
xyzservices==2022.9.0
yapf==0.32.0
yarl==1.8.2
zipp==3.11.0
zope.interface==4.7.1

Krasner avatar Apr 28 '23 15:04 Krasner

As expected the problem is with tensorflow_io not being used. I propose a few solutions:

  1. Imports In backend/event_processing/io_wrapper.py:
import tensorflow as tf
import tensorflow_io as tfio
import s3fs

Note the import of s3fs - this is because tf.io.gfile.glob is VERY slow for recursing through an aws s3 path.

  1. Walk through s3 path:
def S3ListRecursivelyViaWalking(top):
    s3 = s3fs.S3FileSystem()
    for dir_path, _, filenames in s3.walk(top, topdown=True, refresh=True):
        yield (
            "s3://" + dir_path,
            (os.path.join("s3://" + dir_path, filename) for filename in filenames),
        )
  1. Use above method to index s3 path:
if io_util.IsCloudPath(path):
        # Glob-ing for files can be significantly faster than recursively
        # walking through directories for some file systems.
        logger.info(
            "GetLogdirSubdirectories: Starting to list directories via glob-ing."
        )
        if io_util.IsS3Path(path):
            traversal_method = S3ListRecursivelyViaWalking
        else:
            traversal_method = ListRecursivelyViaGlobbing
  1. Add io_util.IsS3Path function in util/io_util.py:
def IsS3Path(path):
    return path.startswith("s3://")

Thoughts?

Krasner avatar Apr 28 '23 17:04 Krasner

Hi @Krasner,

We added S3 support in https://github.com/tensorflow/tensorboard/pull/5491 (since TensorBoard v2.6). If the S3 directory parsing failed due to tensorflow-io not found, the error message would be something like Error: Unsupported filename scheme S3... (e.g. https://github.com/tensorflow/tensorboard/issues/5480), and it will prompt you to install TF I/O. I can see that TF I/O dependency exists in your environment from the diagnostics output, so I'm not sure if this is an issue with identifying and parsing S3 files.

The error messages Error message: No response body and If the signature check failed. This could be because of a time skew. Attempting to adjust the signer look like permission or configuration issue related to S3. I'm not familiar with AWS, is it possible to adjust the AWS_LOG_LEVEL (or maybe there is another arg) to get more information about the failure?

yatbear avatar Apr 28 '23 20:04 yatbear

@yatbear I don't think it's a permission issue - as I noted above, I can access aws s3 from my ec2 instance, and if I import tensorflow_io in my script then I am also able to access aws files with tf.io.gfile. However without the explicit import of this library tf.io.gfile will fail.

Interestingly, after the fixes above the error messages are still visible with AWS_LOG_LEVEL=1 but tensorboard is able to access event files on s3.

Additionally, as I mentioned tf.io.gfile is very slow compared to s3fs for accessing s3 files.

Krasner avatar Apr 28 '23 21:04 Krasner

@Krasner, thanks for the clarification and the proposed solutions above! I just saw this open issue under tensorflow-io repo: https://github.com/tensorflow/io/issues/1731, which suggests the problem lies here. A temporary workaround mentioned in https://github.com/tensorflow/io/issues/1731#issuecomment-1332779337 is to pin tensorflow-io dependency to 0.27.0, could you try this? In the meantime, I will do a bit more investigation before adding the new dependency s3fs.

yatbear avatar Apr 28 '23 22:04 yatbear

I saw this recent fix related S3: https://github.com/tensorflow/io/pull/1790, but it is not included to the latest tensorflow-io pip version: https://pypi.org/project/tensorflow-io/#history, and their nightly is also stale, left a comment under the aforementioned PR.

yatbear avatar May 15 '23 16:05 yatbear

I am get the same error when using tensorboard --logdir s3://zenml-minio-store/logs/... I used version as below; tensorflow=2.8.0, tensorboard=2.8.0, tensorflow-io=0.24.0 I have tried to update to tensorflow=2.12.0, tensorboard=2.12.3, tensorflow-io= 0.33.0, but i doesn't work

ngohoanganh96 avatar Aug 30 '23 09:08 ngohoanganh96