lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Use DistributedShampoo but get unexpected error

Open howellyoung-s opened this issue 4 years ago • 2 comments

My system configure Docker Image: tensorflow/tensorflow:latest-gpu-py3 Ubuntu 18.04 GCC 7.5.0 Tensorflow 2.1.0 CUDA 10.1 Python 3.6 Lingvo: 8eb904be3bc47e541f24a64a1a86751808b1bfe8 Commits on Apr 23, 2020

Problem I install lingvo through ‘pip install lingvo’, but the DistributedShampoo optimizer is missing. So I do as below to workaround:

  1. copy everything under lingvo/core/ from master branch to “python3.6/dist-packages/lingvo” directory
  2. compile lingvo/core/x_ops with bazel, and copy the libx_ops.so to “python3.6/dist-packages/lingvo”
  3. run the distributed_shampoo_test.py python3 distributed_shampoo_test.py and got the erros below. I am not sure I am using it rightly, if someone can give advices, that will be helpful!

'2020-04-24 07:29:30.282871: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-04-24 07:29:30.282882: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-04-24 07:29:30.282926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-04-24 07:29:30.282937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-04-24 07:29:30.282946: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-04-24 07:29:30.282955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-04-24 07:29:30.282964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-04-24 07:29:30.283045: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.283916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.284720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-04-24 07:29:30.284769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-04-24 07:29:30.284776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2020-04-24 07:29:30.284781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2020-04-24 07:29:30.284938: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.285787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.286625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6487 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0) [ OK ] TensorPartitionerTest.testTensorPartitioner [ RUN ] TensorPartitionerTest.test_session [ SKIPPED ] TensorPartitionerTest.test_session

ERROR: testShampooWithMatrixShapedTensorsWithBlocks (main.DistributedShampooTest) testShampooWithMatrixShapedTensorsWithBlocks (main.DistributedShampooTest)

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [2,2] vs. [4,4] [[{{node add}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "distributed_shampoo_test.py", line 308, in testShampooWithMatrixShapedTensorsWithBlocks assign_preconditioners_to_vars_op.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 2391, in run _run_using_default_session(self, feed_dict, self.graph, session) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 5347, in _run_using_default_session session.run(operation, feed_dict) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [2,2] vs. [4,4] [[node add (defined at /usr/local/lib/python3.6/dist-packages/lingvo/core/distributed_shampoo.py:444) ]]

Errors may have originated from an input operation. Input Source operations connected to node add: mul_1 (defined at /usr/local/lib/python3.6/dist-packages/lingvo/core/distributed_shampoo.py:443)

Original stack trace for 'add': File "distributed_shampoo_test.py", line 411, in tf.test.main() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/test.py", line 64, in main return _googletest.main(argv) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/googletest.py", line 65, in main benchmark.benchmarks_main(true_main=main_wrapper) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/benchmark.py", line 463, in benchmarks_main true_main() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/googletest.py", line 64, in main_wrapper return app.run(main=g_main, argv=args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/googletest.py", line 55, in g_main absltest_main(argv=argv) File "/usr/local/lib/python3.6/dist-packages/absl/testing/absltest.py", line 1948, in main _run_in_app(run_tests, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/absl/testing/absltest.py", line 2055, in _run_in_app function(argv, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/absl/testing/absltest.py", line 2334, in run_tests argv, args, kwargs, xml_reporter.TextAndXMLTestRunner) File "/usr/local/lib/python3.6/dist-packages/absl/testing/absltest.py", line 2304, in _run_and_get_tests_result test_program = unittest.TestProgram(*args, **kwargs) File "/usr/lib/python3.6/unittest/main.py", line 95, in init self.runTests() File "/usr/lib/python3.6/unittest/main.py", line 256, in runTests self.result = testRunner.run(self.test) File "/usr/local/lib/python3.6/dist-packages/absl/testing/_pretty_print_reporter.py", line 87, in run return super(TextTestRunner, self).run(test) File "/usr/lib/python3.6/unittest/runner.py", line 176, in run test(result) File "/usr/lib/python3.6/unittest/suite.py", line 84, in call return self.run(*args, **kwds) File "/usr/lib/python3.6/unittest/suite.py", line 122, in run test(result) File "/usr/lib/python3.6/unittest/suite.py", line 84, in call return self.run(*args, **kwds) File "/usr/lib/python3.6/unittest/suite.py", line 122, in run test(result) File "/usr/lib/python3.6/unittest/case.py", line 653, in call return self.run(*args, **kwds) File "/usr/lib/python3.6/unittest/case.py", line 605, in run testMethod() File "distributed_shampoo_test.py", line 258, in testShampooWithMatrixShapedTensorsWithBlocks opt.assign_preconditioner_to_host_vars()) File "/usr/local/lib/python3.6/dist-packages/lingvo/core/distributed_shampoo.py", line 444, in assign_preconditioner_to_host_vars success_mult * preconditioner_val)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 902, in binary_op_wrapper return func(x, y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 1194, in _add_dispatch return gen_math_ops.add_v2(x, y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 483, in add_v2 "AddV2", x=x, y=y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1756, in init self._traceback = tf_stack.extract_stack()

====================================================================== FAIL: testShampooWithMatrixShapedTensors (main.DistributedShampooTest) testShampooWithMatrixShapedTensors (main.DistributedShampooTest)

Traceback (most recent call last): File "distributed_shampoo_test.py", line 102, in testShampooWithMatrixShapedTensors self.assertAllCloseAccordingToType(init_var_np, var_step_0_val, atol=1e-1) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2543, in assertAllCloseAccordingToType self.assertAllClose(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2495, in assertAllClose self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2462, in _assertAllCloseRecursive (path_str, path_str, msg))) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2397, in _assertArrayLikeAllClose a, b, rtol=rtol, atol=atol, err_msg="\n".join(msgs), equal_nan=True) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=1e-06, atol=0.1 Mismatched value: a is different from b. not close where = (array([0, 0, 1, 1, 2, 2, 3, 3]), array([0, 1, 0, 1, 0, 1, 0, 1])) not close lhs = [0. 0. 0. 0. 0. 0. 0. 0.] not close rhs = [-1. -1. -1. -1. -0.99999994 -1. -1. -1. ] not close dif = [1. 1. 1. 1. 0.99999994 1.

  1.     1.        ]
    

not close tol = [0.100001 0.100001 0.100001 0.100001 0.100001 0.100001 0.100001 0.100001] dtype = float64, shape = (4, 2) Mismatched elements: 8 / 8 (100%) Max absolute difference: 1. Max relative difference: 1. x: array([[0., 0.], [0., 0.], [0., 0.], [0., 0.]]) y: array([[-1., -1.], [-1., -1.], [-1., -1.], [-1., -1.]], dtype=float32)

====================================================================== FAIL: testShampooWithMatrixShapedTensorsRightOnlyPreconditioner (main.DistributedShampooTest) testShampooWithMatrixShapedTensorsRightOnlyPreconditioner (main.DistributedShampooTest)

Traceback (most recent call last): File "distributed_shampoo_test.py", line 195, in testShampooWithMatrixShapedTensorsRightOnlyPreconditioner mat_right, expected_mat_right, atol=1e-1) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2543, in assertAllCloseAccordingToType self.assertAllClose(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2495, in assertAllClose self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2462, in _assertAllCloseRecursive (path_str, path_str, msg))) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2397, in _assertArrayLikeAllClose a, b, rtol=rtol, atol=atol, err_msg="\n".join(msgs), equal_nan=True) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=1e-06, atol=0.1 Mismatched value: a is different from b. not close where = (array([0, 0, 1, 1]), array([0, 1, 0, 1])) not close lhs = [ 1.49313118 -0.33912626 -0.33912626 1.32734667] not close rhs = [ 1.2134336 -0.14391004 -0.14391004 1.143082 ] not close dif = [0.27969755 0.19521622 0.19521622 0.18426464] not close tol = [0.10000122 0.10000014 0.10000014 0.10000114] dtype = float64, shape = (2, 2) Mismatched elements: 4 / 4 (100%) Max absolute difference: 0.27969755 Max relative difference: 1.35651568 x: array([[ 1.493131, -0.339126], [-0.339126, 1.327347]]) y: array([[ 1.213434, -0.14391 ], [-0.14391 , 1.143082]], dtype=float32)


Ran 6 tests in 5.434s

FAILED (failures=2, errors=1, skipped=2)'

howellyoung-s avatar Apr 24 '20 11:04 howellyoung-s

I updated the pip package. Please give it a try.

jonathanasdf avatar Apr 25 '20 03:04 jonathanasdf

@jonathanasdf That works, thanks very much!!

howellyoung-s avatar Apr 27 '20 04:04 howellyoung-s