xla icon indicating copy to clipboard operation
xla copied to clipboard

`test_variable_sequence_xla` fails upon updating sym_sizes for dynamic shape

Open miladm opened this issue 3 years ago • 3 comments

🐛 Bug

After resolving some earlier issues via this https://github.com/pytorch/xla/commit/7da8d3b47e09332ac43c4c09ac78883c676d7594, I run into the following failures. It turns out that the standalone run of these tests pass, but when they run along with other python tests, we observe the failure.

@JackCaoG have you seen this pattern where the tests would pass independently but not when run along with other tests?

Failing Error:

test_upsamplingNearest3d_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact1d_correctness_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact1d_rescale_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact2d_correctness_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_upsamplingNearestExact3d_correctness_xla (__main__.TestNNDeviceTypeXLA) ... ok
test_variable_sequence_xla (__main__.TestNNDeviceTypeXLA) ... skipped 'skipped on XLA'

======================================================================
ERROR: test_cross_entropy_label_smoothing_consistent_index_target_and_probs_xla (__main__.TestNNDeviceTypeXLA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 390, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
    result = test(self, **param_kwargs)
  File "/workspace/pytorch/xla/test/../../test/test_nn.py", line 20124, in test_cross_entropy_label_smoothing_consistent_index_target_and_probs
    output_with_index = loss(input, target)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1175, in forward
    label_smoothing=self.label_smoothing)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 3020, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

======================================================================
ERROR: test_cross_entropy_label_smoothing_weight_ignore_indices_xla (__main__.TestNNDeviceTypeXLA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 390, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
    result = test(self, **param_kwargs)
  File "/workspace/pytorch/xla/test/../../test/test_nn.py", line 20180, in test_cross_entropy_label_smoothing_weight_ignore_indices
    check_equal(loss, (inp1, targ_default_ignore_index), (inp2, targ_default_ignore_index))
  File "/workspace/pytorch/xla/test/../../test/test_nn.py", line 20172, in check_equal
    l1 = loss(inp1, targ1)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1175, in forward
    label_smoothing=self.label_smoothing)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 3020, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

----------------------------------------------------------------------
Ran 901 tests in 918.780s

Passing local tests:

$ python ../test/test_nn.py -v TestNNDeviceTypeXLA.test_cross_entropy_label_smoothing_consistent_index_target_and_probs_xla

2022-08-11 20:02:17.789034: W 3579590 tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-08-11 20:02:17.789093: W 3579590 tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
test_cross_entropy_label_smoothing_consistent_index_target_and_probs_xla (__main__.TestNNDeviceTypeXLA) ... ok

----------------------------------------------------------------------
Ran 1 test in 27.517s

OK
$ python ../test/test_nn.py -v TestNNDeviceTypeXLA.test_cross_entropy_label_smoothing_weight_ignore_indices_xla

2022-08-11 20:03:06.514888: W 3581676 tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-08-11 20:03:06.514963: W 3581676 tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
test_cross_entropy_label_smoothing_weight_ignore_indices_xla (__main__.TestNNDeviceTypeXLA) ... ok

----------------------------------------------------------------------
Ran 1 test in 1.802s

OK

miladm avatar Aug 11 '22 20:08 miladm

CC @Gamrix

miladm avatar Aug 11 '22 20:08 miladm

This issue occurs because run_tests.sh executes the test under experimental model meaning it enables dynamic ops to run through PyTorch/XLA.

...
Counter: aten::_local_scalar_dense
  Value: 60
Counter: aten::masked_select             <--- dynamic op
  Value: 10
Counter: aten::nonzero                  <--- dynamic op
  Value: 30
Counter: xla::_copy_from
  Value: 160
Counter: xla::_log_softmax
  Value: 60
Counter: xla::_to_cpu
  Value: 100
Counter: xla::add
  Value: 30
Counter: xla::bitwise_not
  Value: 40
Counter: xla::div
  Value: 10
Counter: xla::empty
  Value: 190
Counter: xla::eq
  Value: 60
Counter: xla::fill_
  Value: 30
Counter: xla::index_put_
  Value: 30
Counter: xla::masked_fill_
  Value: 60
Counter: xla::masked_select
  Value: 10
Counter: xla::max
  Value: 30
Counter: xla::mean
  Value: 10
Counter: xla::min
  Value: 30
Counter: xla::mul
  Value: 90
Counter: xla::neg
  Value: 60
Counter: xla::nll_loss2d_forward
  Value: 24
Counter: xla::nll_loss_forward
  Value: 6
Counter: xla::nonzero
  Value: 30
Counter: xla::normal_
  Value: 30
Counter: xla::permute
  Value: 30
Counter: xla::random_
  Value: 30
Counter: xla::scatter
  Value: 30
Counter: xla::scatter_reduce_helper
  Value: 30
Counter: xla::select
  Value: 90
Counter: xla::sum
  Value: 70
Counter: xla::unsqueeze
  Value: 30
Counter: xla::view
  Value: 42
Counter: xla::zero_
  Value: 30
...

miladm avatar Aug 14 '22 07:08 miladm

aten::masked_select actually means masked_select fallback to cpu instead of executing it on xla. Otherwise you should only see xla::masked_select. I think it executed the https://github.com/pytorch/xla/blob/3935e4445eba5af370ebc01b4daf5cec4c026900/torch_xla/csrc/aten_xla_type.cpp#L1724-L1728

JackCaoG avatar Aug 14 '22 07:08 JackCaoG