`test_variable_sequence_xla` fails upon updating sym_sizes for dynamic shape
🐛 Bug
After resolving some earlier issues via this https://github.com/pytorch/xla/commit/7da8d3b47e09332ac43c4c09ac78883c676d7594, I run into the following failures. It turns out that the standalone run of these tests pass, but when they run along with other python tests, we observe the failure.
@JackCaoG have you seen this pattern where the tests would pass independently but not when run along with other tests?
Failing Error:
test_upsamplingNearest3d_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact1d_correctness_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact1d_rescale_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact2d_correctness_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact3d_correctness_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_variable_sequence_xla (__main__.TestNNDeviceTypeXLA) ... skipped 'skipped on XLA'
======================================================================
ERROR: test_cross_entropy_label_smoothing_consistent_index_target_and_probs_xla (__main__.TestNNDeviceTypeXLA)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 390, in instantiated_test
raise rte
File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
result = test(self, **param_kwargs)
File "/workspace/pytorch/xla/test/../../test/test_nn.py", line 20124, in test_cross_entropy_label_smoothing_consistent_index_target_and_probs
output_with_index = loss(input, target)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1175, in forward
label_smoothing=self.label_smoothing)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 3020, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
======================================================================
ERROR: test_cross_entropy_label_smoothing_weight_ignore_indices_xla (__main__.TestNNDeviceTypeXLA)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 390, in instantiated_test
raise rte
File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
result = test(self, **param_kwargs)
File "/workspace/pytorch/xla/test/../../test/test_nn.py", line 20180, in test_cross_entropy_label_smoothing_weight_ignore_indices
check_equal(loss, (inp1, targ_default_ignore_index), (inp2, targ_default_ignore_index))
File "/workspace/pytorch/xla/test/../../test/test_nn.py", line 20172, in check_equal
l1 = loss(inp1, targ1)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1175, in forward
label_smoothing=self.label_smoothing)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 3020, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
----------------------------------------------------------------------
Ran 901 tests in 918.780s
Passing local tests:
$ python ../test/test_nn.py -v TestNNDeviceTypeXLA.test_cross_entropy_label_smoothing_consistent_index_target_and_probs_xla
2022-08-11 20:02:17.789034: W 3579590 tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-08-11 20:02:17.789093: W 3579590 tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
test_cross_entropy_label_smoothing_consistent_index_target_and_probs_xla (__main__.TestNNDeviceTypeXLA) ... ok
----------------------------------------------------------------------
Ran 1 test in 27.517s
OK
$ python ../test/test_nn.py -v TestNNDeviceTypeXLA.test_cross_entropy_label_smoothing_weight_ignore_indices_xla
2022-08-11 20:03:06.514888: W 3581676 tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-08-11 20:03:06.514963: W 3581676 tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
test_cross_entropy_label_smoothing_weight_ignore_indices_xla (__main__.TestNNDeviceTypeXLA) ... ok
----------------------------------------------------------------------
Ran 1 test in 1.802s
OK
CC @Gamrix
This issue occurs because run_tests.sh executes the test under experimental model meaning it enables dynamic ops to run through PyTorch/XLA.
...
Counter: aten::_local_scalar_dense
Value: 60
Counter: aten::masked_select <--- dynamic op
Value: 10
Counter: aten::nonzero <--- dynamic op
Value: 30
Counter: xla::_copy_from
Value: 160
Counter: xla::_log_softmax
Value: 60
Counter: xla::_to_cpu
Value: 100
Counter: xla::add
Value: 30
Counter: xla::bitwise_not
Value: 40
Counter: xla::div
Value: 10
Counter: xla::empty
Value: 190
Counter: xla::eq
Value: 60
Counter: xla::fill_
Value: 30
Counter: xla::index_put_
Value: 30
Counter: xla::masked_fill_
Value: 60
Counter: xla::masked_select
Value: 10
Counter: xla::max
Value: 30
Counter: xla::mean
Value: 10
Counter: xla::min
Value: 30
Counter: xla::mul
Value: 90
Counter: xla::neg
Value: 60
Counter: xla::nll_loss2d_forward
Value: 24
Counter: xla::nll_loss_forward
Value: 6
Counter: xla::nonzero
Value: 30
Counter: xla::normal_
Value: 30
Counter: xla::permute
Value: 30
Counter: xla::random_
Value: 30
Counter: xla::scatter
Value: 30
Counter: xla::scatter_reduce_helper
Value: 30
Counter: xla::select
Value: 90
Counter: xla::sum
Value: 70
Counter: xla::unsqueeze
Value: 30
Counter: xla::view
Value: 42
Counter: xla::zero_
Value: 30
...
aten::masked_select actually means masked_select fallback to cpu instead of executing it on xla. Otherwise you should only see xla::masked_select. I think it executed the
https://github.com/pytorch/xla/blob/3935e4445eba5af370ebc01b4daf5cec4c026900/torch_xla/csrc/aten_xla_type.cpp#L1724-L1728