lingvo
lingvo copied to clipboard
Use DistributedShampoo but get unexpected error
My system configure Docker Image: tensorflow/tensorflow:latest-gpu-py3 Ubuntu 18.04 GCC 7.5.0 Tensorflow 2.1.0 CUDA 10.1 Python 3.6 Lingvo: 8eb904be3bc47e541f24a64a1a86751808b1bfe8 Commits on Apr 23, 2020
Problem I install lingvo through ‘pip install lingvo’, but the DistributedShampoo optimizer is missing. So I do as below to workaround:
- copy everything under lingvo/core/ from master branch to “python3.6/dist-packages/lingvo” directory
- compile lingvo/core/x_ops with bazel, and copy the libx_ops.so to “python3.6/dist-packages/lingvo”
- run the distributed_shampoo_test.py
python3 distributed_shampoo_test.py
and got the erros below. I am not sure I am using it rightly, if someone can give advices, that will be helpful!
'2020-04-24 07:29:30.282871: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-04-24 07:29:30.282882: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-04-24 07:29:30.282926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-04-24 07:29:30.282937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-04-24 07:29:30.282946: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-04-24 07:29:30.282955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-04-24 07:29:30.282964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-04-24 07:29:30.283045: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.283916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.284720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-04-24 07:29:30.284769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-04-24 07:29:30.284776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2020-04-24 07:29:30.284781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2020-04-24 07:29:30.284938: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.285787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-24 07:29:30.286625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6487 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0) [ OK ] TensorPartitionerTest.testTensorPartitioner [ RUN ] TensorPartitionerTest.test_session [ SKIPPED ] TensorPartitionerTest.test_session
ERROR: testShampooWithMatrixShapedTensorsWithBlocks (main.DistributedShampooTest) testShampooWithMatrixShapedTensorsWithBlocks (main.DistributedShampooTest)
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [2,2] vs. [4,4] [[{{node add}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "distributed_shampoo_test.py", line 308, in testShampooWithMatrixShapedTensorsWithBlocks assign_preconditioners_to_vars_op.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 2391, in run _run_using_default_session(self, feed_dict, self.graph, session) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 5347, in _run_using_default_session session.run(operation, feed_dict) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [2,2] vs. [4,4] [[node add (defined at /usr/local/lib/python3.6/dist-packages/lingvo/core/distributed_shampoo.py:444) ]]
Errors may have originated from an input operation. Input Source operations connected to node add: mul_1 (defined at /usr/local/lib/python3.6/dist-packages/lingvo/core/distributed_shampoo.py:443)
Original stack trace for 'add':
File "distributed_shampoo_test.py", line 411, in
====================================================================== FAIL: testShampooWithMatrixShapedTensors (main.DistributedShampooTest) testShampooWithMatrixShapedTensors (main.DistributedShampooTest)
Traceback (most recent call last): File "distributed_shampoo_test.py", line 102, in testShampooWithMatrixShapedTensors self.assertAllCloseAccordingToType(init_var_np, var_step_0_val, atol=1e-1) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2543, in assertAllCloseAccordingToType self.assertAllClose(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2495, in assertAllClose self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2462, in _assertAllCloseRecursive (path_str, path_str, msg))) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2397, in _assertArrayLikeAllClose a, b, rtol=rtol, atol=atol, err_msg="\n".join(msgs), equal_nan=True) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=1e-06, atol=0.1 Mismatched value: a is different from b. not close where = (array([0, 0, 1, 1, 2, 2, 3, 3]), array([0, 1, 0, 1, 0, 1, 0, 1])) not close lhs = [0. 0. 0. 0. 0. 0. 0. 0.] not close rhs = [-1. -1. -1. -1. -0.99999994 -1. -1. -1. ] not close dif = [1. 1. 1. 1. 0.99999994 1.
-
1. ]
not close tol = [0.100001 0.100001 0.100001 0.100001 0.100001 0.100001 0.100001 0.100001] dtype = float64, shape = (4, 2) Mismatched elements: 8 / 8 (100%) Max absolute difference: 1. Max relative difference: 1. x: array([[0., 0.], [0., 0.], [0., 0.], [0., 0.]]) y: array([[-1., -1.], [-1., -1.], [-1., -1.], [-1., -1.]], dtype=float32)
====================================================================== FAIL: testShampooWithMatrixShapedTensorsRightOnlyPreconditioner (main.DistributedShampooTest) testShampooWithMatrixShapedTensorsRightOnlyPreconditioner (main.DistributedShampooTest)
Traceback (most recent call last): File "distributed_shampoo_test.py", line 195, in testShampooWithMatrixShapedTensorsRightOnlyPreconditioner mat_right, expected_mat_right, atol=1e-1) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2543, in assertAllCloseAccordingToType self.assertAllClose(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1153, in decorated return f(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2495, in assertAllClose self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2462, in _assertAllCloseRecursive (path_str, path_str, msg))) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 2397, in _assertArrayLikeAllClose a, b, rtol=rtol, atol=atol, err_msg="\n".join(msgs), equal_nan=True) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=1e-06, atol=0.1 Mismatched value: a is different from b. not close where = (array([0, 0, 1, 1]), array([0, 1, 0, 1])) not close lhs = [ 1.49313118 -0.33912626 -0.33912626 1.32734667] not close rhs = [ 1.2134336 -0.14391004 -0.14391004 1.143082 ] not close dif = [0.27969755 0.19521622 0.19521622 0.18426464] not close tol = [0.10000122 0.10000014 0.10000014 0.10000114] dtype = float64, shape = (2, 2) Mismatched elements: 4 / 4 (100%) Max absolute difference: 0.27969755 Max relative difference: 1.35651568 x: array([[ 1.493131, -0.339126], [-0.339126, 1.327347]]) y: array([[ 1.213434, -0.14391 ], [-0.14391 , 1.143082]], dtype=float32)
Ran 6 tests in 5.434s
FAILED (failures=2, errors=1, skipped=2)'
I updated the pip package. Please give it a try.
@jonathanasdf That works, thanks very much!!