lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Target //lingvo/jax:main failed to build

Open ruomingp opened this issue 2 years ago • 9 comments

To reproduce:

% docker build --tag tensorflow:lingvo - < "$LINGVO_DIR/lingvo/jax/pip_package/build.Dockerfile"
...
% docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash
#
# bazel run -c opt \
>     lingvo/jax:main -- \
>     --model=lm.ptb.PTBCharTransformerSmallSgd \
>     --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: Rule 'subpar' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "35bb9f0092f71ea56b742a520602da9b3638a24f", shallow_since = "1557863961 -0400" and dropping ["tag"]
DEBUG: Repository subpar instantiated at:
  /tmp/lingvo/WORKSPACE:12:15: in <toplevel>
Repository rule git_repository defined at:
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/bazel_tools/tools/build_defs/repo/git.bzl:199:33: in <toplevel>
INFO: Analyzed target //lingvo/jax:main (36 packages loaded, 6669 targets configured).
INFO: Found 1 target...
INFO: From Compiling icu4c/source/common/unistr.cpp:
external/icu/icu4c/source/common/unistr.cpp:1975:13: warning: 'void uprv_UnicodeStringDummy()' defined but not used [-Wunused-function]
 static void uprv_UnicodeStringDummy(void) {
             ^
INFO: From Compiling icu4c/source/common/ucptrie.cpp:
external/icu/icu4c/source/common/ucptrie.cpp: In function 'UChar32 {anonymous}::getRange(const void*, UChar32, uint32_t (*)(const void*, uint32_t), const void*, uint32_t*)':
external/icu/icu4c/source/common/ucptrie.cpp:404:5: warning: 'value' may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (maybeFilterValue(highValue, trie->nullValue, nullValue,
     ^
INFO: From Compiling lingvo/core/ops/record_yielder.cc:
lingvo/core/ops/record_yielder.cc:347:6: warning: 'tensorflow::lingvo::{anonymous}::register_text_iterator' defined but not used [-Wunused-variable]
 bool register_text_iterator = RecordIterator::Register(
      ^
lingvo/core/ops/record_yielder.cc:356:6: warning: 'tensorflow::lingvo::{anonymous}::register_indirect_text_iterator' defined but not used [-Wunused-variable]
 bool register_indirect_text_iterator =
      ^
lingvo/core/ops/record_yielder.cc:366:6: warning: 'tensorflow::lingvo::{anonymous}::register_tf_record_iterator' defined but not used [-Wunused-variable]
 bool register_tf_record_iterator =
      ^
lingvo/core/ops/record_yielder.cc:371:6: warning: 'tensorflow::lingvo::{anonymous}::register_tf_record_gzip_iterator' defined but not used [-Wunused-variable]
 bool register_tf_record_gzip_iterator =
      ^
lingvo/core/ops/record_yielder.cc:376:6: warning: 'tensorflow::lingvo::{anonymous}::register_iota_iterator' defined but not used [-Wunused-variable]
 bool register_iota_iterator = RecordIterator::RegisterWithPatternParser(
      ^
ERROR: /tmp/lingvo/lingvo/core/ops/BUILD:180:18: Compiling lingvo/core/ops/input_common.cc failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 64 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 64 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
In file included from lingvo/core/ops/input_common.cc:16:0:
./lingvo/core/ops/input_common.h:143:55: error: expected class-name before '{' token
 class InputResource : public tensorflow::ResourceBase {
                                                       ^
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2Create<RecordProcessorClass>::Compute(tensorflow::OpKernelContext*)':
./lingvo/core/ops/input_common.h:228:9: error: 'MakeRefCountingHandle' is not a member of 'tensorflow::ResourceHandle'
         ResourceHandle::MakeRefCountingHandle(resource, ctx->device()->name(),
         ^
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2GetNext<RecordProcessorClass>::Compute(tensorflow::OpKernelContext*)':
./lingvo/core/ops/input_common.h:252:28: error: 'const class tensorflow::ResourceHandle' has no member named 'GetResource'
     auto statusor = handle.GetResource<resource_type>();
                            ^
./lingvo/core/ops/input_common.h:252:53: error: expected primary-expression before '>' token
     auto statusor = handle.GetResource<resource_type>();
                                                     ^
./lingvo/core/ops/input_common.h:252:55: error: expected primary-expression before ')' token
     auto statusor = handle.GetResource<resource_type>();
                                                       ^
Target //lingvo/jax:main failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 257.942s, Critical Path: 137.19s
INFO: 147 processes: 22 internal, 125 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully

ruomingp avatar Mar 19 '22 00:03 ruomingp

Are you able to build using the provided build.sh script? https://github.com/tensorflow/lingvo/blob/master/lingvo/jax/pip_package/build.sh

This sets several environment variables and flags, and it is hard for me to infer, which one you may need to fix your issue.

laurentes avatar Mar 19 '22 02:03 laurentes

Nvm, I could reproduce your issue after modifying the script.

laurentes avatar Mar 19 '22 02:03 laurentes

Heads up that I have a fix (pending review) that will hopefully land Tuesday morning PDT.

laurentes avatar Mar 22 '22 06:03 laurentes

This should be fixed if you sync after https://github.com/tensorflow/lingvo/commit/fe60d03c9716d1e0c87462b1e92c5ee9b39f0b87 Also make sure to update your docker build with the latest optax-shampoo (v0.0.5).

laurentes avatar Mar 22 '22 17:03 laurentes

Thank you so much, Laurent. Let me try it.

ruomingp avatar Mar 23 '22 01:03 ruomingp

Strangely now I ran into the "No module named 'clu'" error again:

# bazel run -c opt \
>     lingvo/jax:main -- \
>     --model=lm.ptb.PTBCharTransformerSmallSgd \
>     --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
Extracting Bazel installation...
...
INFO: Build completed successfully, 239 total actions
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/main.py", line 34, in <module>
    from clu import platform
ModuleNotFoundError: No module named 'clu'

even though it's installed according to pip list:

root@5c3049184a19:/tmp/lingvo# pip list
Package                           Version
--------------------------------- -------------------
absl-py                           1.0.0
...
clu                               0.0.6
...

ruomingp avatar Mar 24 '22 15:03 ruomingp

Before my run, my docker was running out of space, so I ran

docker system prune --all

ruomingp avatar Mar 24 '22 15:03 ruomingp

I think I know the issue. I should have warned you about this.

Could you try to just run python3 and check the default version / import clu?

My intuition is that the default will be python3.6, which is unsupported / doesn't come with the right dependencies.

All you have to do is to set another python version as the default, e.g. using: update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 1 and then re-run your bazel command.

laurentes avatar Mar 24 '22 17:03 laurentes

Thanks!

Indeed. It's 3.6 and update-alternative solves the problem. Can we update the docker file to avoid 3.6?

# python3 --version
Python 3.6.13

After that, I'm running into another issue. Let me file a separate issue.

ruomingp avatar Mar 24 '22 17:03 ruomingp