lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

lingvo/jax:main fails with "undefined symbol: _ZNK10tensorflow6Status14GetAllPayloadsEv"

Open ruomingp opened this issue 3 years ago • 5 comments

To reproduce:

# bazel run -c opt     lingvo/jax:main --     --model=lm.ptb.PTBCharTransformerSmallSgd     --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
...
2022-03-24 17:46:18.859227: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py", line 22, in <module>
    from lingvo.core.ops import gen_x_ops  # pylint: disable=g-import-not-at-top
ImportError: cannot import name 'gen_x_ops' from 'lingvo.core.ops' (/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/main.py", line 36, in <module>
    from lingvo.jax import eval as eval_lib
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/eval.py", line 29, in <module>
    from lingvo.jax import base_input
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/base_input.py", line 23, in <module>
    from lingvo.core import datasource
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/datasource.py", line 33, in <module>
    from lingvo.core import base_layer
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/base_layer.py", line 27, in <module>
    from lingvo.core import py_utils
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/py_utils.py", line 43, in <module>
    from lingvo.core import ops
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py", line 25, in <module>
    tf.resource_loader.get_path_to_datafile('x_ops.so'))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/x_ops.so: undefined symbol: _ZNK10tensorflow6Status14GetAllPayloadsEv

ruomingp avatar Mar 24 '22 18:03 ruomingp

I'm unable to reproduce internally :( I'm wondering if the external docker image is somehow different. I will try to reproduce starting from the OSS version.

laurentes avatar Mar 24 '22 21:03 laurentes

Could you update which model name you've used? (probably not lm.ptb.PTBCharTransformerSmallSgd)

laurentes avatar Mar 24 '22 23:03 laurentes

Thanks for looking into this, Laurent! It's actually lm.ptb.PTBCharTransformerSmallSgd:

% bazel run -c opt     lingvo/jax:main --  \
   --model=lm.ptb.PTBCharTransformerSmallSgd  \
   --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr

ruomingp avatar Mar 25 '22 00:03 ruomingp

I noticed that lingvo/jax/pip_package/build.Dockerfile does not specify dependency versions explicitly, so maybe we are using different versions of TF?

I see:

tensorflow                        2.8.0
tensorflow-datasets               4.5.2
tensorflow-hub                    0.12.0
tensorflow-io-gcs-filesystem      0.24.0
tensorflow-metadata               1.7.0
tensorflow-text                   2.8.1

ruomingp avatar Mar 25 '22 02:03 ruomingp

My TF versions for python3.7 are exactly the same as yours.

Otherwise, it's definitely not the main issue. But just for the records, we didn't open source configs on PTB like lm.ptb.PTBCharTransformerSmallSgd, so you may want to try out with e.g. lm.lm_cloud.LmCloudSpmdTest instead.

laurentes avatar Mar 25 '22 03:03 laurentes