xla icon indicating copy to clipboard operation
xla copied to clipboard

PyTorch/XLA `gtest` needs to be updated

Open miladm opened this issue 2 years ago • 17 comments

I am running into the following error when building PyTorch/XLA CPP tests using python setup.py install.

Error report from: xla/test/cpp/build/gtest/src/googletest-stamp/googletest-build-err.log

In file included from xla/test/cpp/build/gtest/src/googletest-src/googletest/src/gtest-all.cc:42:
xla/test/cpp/build/gtest/src/googletest-src/googletest/src/gtest-death-test.cc: In function 'bool testing::internal::StackGrowsDown()':
xla/test/cpp/build/gtest/src/googletest-src/googletest/src/gtest-death-test.cc:1301:24: error: 'dummy' may be used uninitialized [-Werror=maybe-uninitialized]
 1301 |   StackLowerThanAddress(&dummy, &result);
      |   ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
xla/test/cpp/build/gtest/src/googletest-src/googletest/src/gtest-death-test.cc:1290:13: note: by argument 1 of type 'const void*' to 'void testing::internal::StackLowerThanAddress(const void*, bool*)' declared here
 1290 | static void StackLowerThanAddress(const void* ptr, bool* result) {
      |             ^~~~~~~~~~~~~~~~~~~~~
xla/test/cpp/build/gtest/src/googletest-src/googletest/src/gtest-death-test.cc:1299:7: note: 'dummy' declared here
 1299 |   int dummy;
      |       ^~~~~
cc1plus: all warnings being treated as errors
make[5]: *** [googletest/CMakeFiles/gtest.dir/build.make:76: googletest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[4]: *** [CMakeFiles/Makefile2:172: googletest/CMakeFiles/gtest.dir/all] Error 2
make[3]: *** [Makefile:146: all] Error 2

Please investigate.


Blocker:

  • [ ] https://github.com/tensorflow/tensorflow/issues/56021

miladm avatar May 19 '22 22:05 miladm

Turns out this problem has existed with the current pin (i.e. commit 6f5fd0d7199b9a19faa in googletest version 1.10) for quite some time (ref). The solutions is to upgrade to version 1.11.0. Doing it now.

miladm avatar May 19 '22 22:05 miladm

I tried a few pin from version 1.11.0 including e2239ee6043f73722e7aa812a459f54a28552929 and 4679637f1c9d5a0728bdc347a531737fad0b1ca3. None of them gave me a successful build result.

Observations:

  • The initial error that prompted this issue is indeed fixed in version 1.11
  • I run into a new set of errors. A snippet of the errors I observe is listed below.
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:454:5: error: variable of non-literal type '::tensorflow::internal::CheckOpString' cannot be defined in a constexpr function
    DCHECK_GE(width, 0) << "Unsupported width " << width;
    ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:472:31: note: expanded from macro 'DCHECK_GE'
#define DCHECK_GE(val1, val2) CHECK_GE(val1, val2)
                              ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:459:30: note: expanded from macro 'CHECK_GE'
#define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
                             ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:452:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:445:48: note: expanded from macro 'CHECK_OP_LOG'
  while (::tensorflow::internal::CheckOpString _result{        \
                                               ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:306:8: note: 'CheckOpString' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
struct CheckOpString {
       ^
In file included from xla/test/cpp/torch_xla_test.cpp:10:
In file included from xla/torch_xla/csrc/helpers.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_builder.h:31:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:495:5: error: variable of non-literal type '::tensorflow::internal::CheckOpString' cannot be defined in a constexpr function
    DCHECK_GE(exponent, 0);
    ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:472:31: note: expanded from macro 'DCHECK_GE'
#define DCHECK_GE(val1, val2) CHECK_GE(val1, val2)
                              ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:459:30: note: expanded from macro 'CHECK_GE'
#define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
                             ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:452:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:445:48: note: expanded from macro 'CHECK_OP_LOG'
  while (::tensorflow::internal::CheckOpString _result{        \
                                               ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:306:8: note: 'CheckOpString' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
struct CheckOpString {
       ^
In file included from xla/test/cpp/torch_xla_test.cpp:10:
In file included from xla/torch_xla/csrc/helpers.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_builder.h:31:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:568:12: error: no matching function for call to 'LsbMask'
    return LsbMask<uint64_t>(bits);
           ^~~~~~~~~~~~~~~~~
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:448:20: note: candidate template ignored: substitution failure [with T = unsigned long]
constexpr inline T LsbMask(int width)
                   ^
In file included from xla/test/cpp/test_op_by_op_executor.cpp:3:
In file included from xla/test/cpp/cpp_test_util.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/computation_client.h:13:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:454:5: error: variable of non-literal type '::tensorflow::internal::CheckOpString' cannot be defined in a constexpr function
    DCHECK_GE(width, 0) << "Unsupported width " << width;
    ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:472:31: note: expanded from macro 'DCHECK_GE'
#define DCHECK_GE(val1, val2) CHECK_GE(val1, val2)
                              ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:459:30: note: expanded from macro 'CHECK_GE'
#define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
                             ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:452:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:445:48: note: expanded from macro 'CHECK_OP_LOG'
  while (::tensorflow::internal::CheckOpString _result{        \
                                               ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:306:8: note: 'CheckOpString' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
struct CheckOpString {
       ^
In file included from xla/test/cpp/test_op_by_op_executor.cpp:3:
In file included from xla/test/cpp/cpp_test_util.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/computation_client.h:13:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:495:5: error: variable of non-literal type '::tensorflow::internal::CheckOpString' cannot be defined in a constexpr function
    DCHECK_GE(exponent, 0);
    ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:472:31: note: expanded from macro 'DCHECK_GE'
#define DCHECK_GE(val1, val2) CHECK_GE(val1, val2)
                              ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:459:30: note: expanded from macro 'CHECK_GE'
#define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
                             ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:452:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:445:48: note: expanded from macro 'CHECK_OP_LOG'
  while (::tensorflow::internal::CheckOpString _result{        \
                                               ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:306:8: note: 'CheckOpString' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
struct CheckOpString {
       ^
In file included from xla/test/cpp/test_op_by_op_executor.cpp:3:
In file included from xla/test/cpp/cpp_test_util.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/computation_client.h:13:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:568:12: error: no matching function for call to 'LsbMask'
    return LsbMask<uint64_t>(bits);
           ^~~~~~~~~~~~~~~~~
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:448:20: note: candidate template ignored: substitution failure [with T = unsigned long]
constexpr inline T LsbMask(int width)
                   ^
In file included from xla/test/cpp/test_tensor.cpp:7:
In file included from xla/test/cpp/cpp_test_util.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/computation_client.h:13:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:454:5: error: variable of non-literal type '::tensorflow::internal::CheckOpString' cannot be defined in a constexpr function
    DCHECK_GE(width, 0) << "Unsupported width " << width;
    ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:472:31: note: expanded from macro 'DCHECK_GE'
#define DCHECK_GE(val1, val2) CHECK_GE(val1, val2)
                              ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:459:30: note: expanded from macro 'CHECK_GE'
#define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
                             ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:452:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:445:48: note: expanded from macro 'CHECK_OP_LOG'
  while (::tensorflow::internal::CheckOpString _result{        \
                                               ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:306:8: note: 'CheckOpString' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
struct CheckOpString {
       ^
In file included from xla/test/cpp/test_tensor.cpp:7:
In file included from xla/test/cpp/cpp_test_util.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/computation_client.h:13:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:495:5: error: variable of non-literal type '::tensorflow::internal::CheckOpString' cannot be defined in a constexpr function
    DCHECK_GE(exponent, 0);
    ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:472:31: note: expanded from macro 'DCHECK_GE'
#define DCHECK_GE(val1, val2) CHECK_GE(val1, val2)
                              ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:459:30: note: expanded from macro 'CHECK_GE'
#define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
                             ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:452:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:445:48: note: expanded from macro 'CHECK_OP_LOG'
  while (::tensorflow::internal::CheckOpString _result{        \
                                               ^
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/core/platform/default/logging.h:306:8: note: 'CheckOpString' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
struct CheckOpString {
       ^
In file included from xla/test/cpp/test_tensor.cpp:7:
In file included from xla/test/cpp/cpp_test_util.h:12:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/computation_client.h:13:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/client/xla_computation.h:22:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/shape.h:25:
In file included from xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/layout.h:25:
xla/third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/util.h:568:12: error: no matching function for call to 'LsbMask'

miladm avatar May 19 '22 22:05 miladm

Just FYI, I've been observing the LsbMask error for a while and haven't been able to resolve it. FWIW, I noticed that another user has opened a related ticket in tensorflow -- https://github.com/tensorflow/tensorflow/issues/56021. On the issue, if you click expand, you'll see the error that this user is experiencing is:

In file included from ./tensorflow/compiler/xla/array.h:35:
./tensorflow/compiler/xla/util.h:568:12: error: no matching function for call to 'LsbMask'
    return LsbMask<uint64_t>(bits);
           ^~~~~~~~~~~~~~~~~
./tensorflow/compiler/xla/util.h:448:20: note: candidate template ignored: substitution failure [with T = unsigned long long]
constexpr inline T LsbMask(int width)
                   ^
3 errors generated.

wonjoolee95 avatar May 19 '22 22:05 wonjoolee95

LsbMask is the error I see locally

JackCaoG avatar May 19 '22 22:05 JackCaoG

Thanks @wonjoolee95. I included the tensorflow ticket as a blocker to this issue.

miladm avatar May 19 '22 22:05 miladm

The ticket remains open as the issue is blocked on the tensorflow ticket at the moment. Reopening.

miladm avatar May 20 '22 04:05 miladm

got this error again, is there any workround ?

XBWGC avatar Jun 06 '22 10:06 XBWGC

You can build with BUILD_CPP_TESTS=0 to get around this issue.

JackCaoG avatar Jun 06 '22 18:06 JackCaoG

@JackCaoG any more insight on this? opened a new issue https://github.com/tensorflow/tensorflow/issues/56430

ngam avatar Jun 13 '22 14:06 ngam

We don't run into the issue if we build pt/xla test on a newer docker images we published, but it still fails in my local environment. This seems to be a build system issue that is tricky to resolve.

JackCaoG avatar Jun 13 '22 17:06 JackCaoG

Interesting. I wonder if reverting the contexpr changes can fix this... Waiting for the tensorflow people to weigh in...

Thanks a lot, I made sure to link this issue in the new tf issue

ngam avatar Jun 13 '22 18:06 ngam

Local build passed for me if I set DEBUG=0 which sets the flag in https://github.com/pytorch/xla/blob/master/setup.py#L330

JackCaoG avatar Jun 17 '22 21:06 JackCaoG

@JackCaoG looks like your setup worked on docker images; correct? FWIW, for me, the issue persists outside of a docker image when DEBUG=0.

miladm avatar Jun 17 '22 21:06 miladm

What compiler version are you both using?

According to the response from upstream, https://github.com/tensorflow/tensorflow/issues/56430#issuecomment-1161289033, it may be a version issue. I am trying that now in conda-forge and it seems to be working (so far). We used to get failures around 1 hour into the CI run, but it's been more than 2 hours now without failure...

edit: it is still failing, but still tinkering with versions to see if we could resolve it.

ngam avatar Jun 21 '22 18:06 ngam

FYI, resolved by https://github.com/tensorflow/tensorflow/commit/bc4521dd193290f86bd5de8a56cefbcbfeae3213

ngam avatar Jun 25 '22 02:06 ngam

Thanks @ngam for the heads up, I will try to test after we update our tf pins.

JackCaoG avatar Jun 27 '22 17:06 JackCaoG

Update: I am able to build the test on a brand new docker image with DEBUG=0 on python=3.7

miladm avatar Jul 04 '22 06:07 miladm