lingvo
lingvo copied to clipboard
lingvo/jax:main fails with "undefined symbol: _ZNK10tensorflow6Status14GetAllPayloadsEv"
To reproduce:
# bazel run -c opt lingvo/jax:main -- --model=lm.ptb.PTBCharTransformerSmallSgd --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
...
2022-03-24 17:46:18.859227: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py", line 22, in <module>
from lingvo.core.ops import gen_x_ops # pylint: disable=g-import-not-at-top
ImportError: cannot import name 'gen_x_ops' from 'lingvo.core.ops' (/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/main.py", line 36, in <module>
from lingvo.jax import eval as eval_lib
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/eval.py", line 29, in <module>
from lingvo.jax import base_input
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/base_input.py", line 23, in <module>
from lingvo.core import datasource
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/datasource.py", line 33, in <module>
from lingvo.core import base_layer
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/base_layer.py", line 27, in <module>
from lingvo.core import py_utils
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/py_utils.py", line 43, in <module>
from lingvo.core import ops
File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py", line 25, in <module>
tf.resource_loader.get_path_to_datafile('x_ops.so'))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/x_ops.so: undefined symbol: _ZNK10tensorflow6Status14GetAllPayloadsEv
I'm unable to reproduce internally :( I'm wondering if the external docker image is somehow different. I will try to reproduce starting from the OSS version.
Could you update which model name you've used? (probably not lm.ptb.PTBCharTransformerSmallSgd)
Thanks for looking into this, Laurent! It's actually lm.ptb.PTBCharTransformerSmallSgd:
% bazel run -c opt lingvo/jax:main -- \
--model=lm.ptb.PTBCharTransformerSmallSgd \
--job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
I noticed that lingvo/jax/pip_package/build.Dockerfile does not specify dependency versions explicitly, so maybe we are using different versions of TF?
I see:
tensorflow 2.8.0
tensorflow-datasets 4.5.2
tensorflow-hub 0.12.0
tensorflow-io-gcs-filesystem 0.24.0
tensorflow-metadata 1.7.0
tensorflow-text 2.8.1
My TF versions for python3.7 are exactly the same as yours.
Otherwise, it's definitely not the main issue. But just for the records, we didn't open source configs on PTB like lm.ptb.PTBCharTransformerSmallSgd, so you may want to try out with e.g. lm.lm_cloud.LmCloudSpmdTest instead.