xla icon indicating copy to clipboard operation
xla copied to clipboard

`tensor.shape` is not yet implemented for dynamic tensors

Open miladm opened this issue 2 years ago • 6 comments

The following test fails on Dynamic Shape ops. This is mostly because XLASymIntNodeImpl doesn't support ToString()

>>> a2.shape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: RuntimeError: NYI

In the above test, a is a dynamic tensor

CC @vanbasten23

miladm avatar Oct 03 '22 23:10 miladm

In response to the first comment, at https://github.com/pytorch/pytorch/blob/d39e9c1e9087069fa774b0e3eb47e04750edca88/c10/core/SymIntNodeImpl.h#L85, I changed to a more specific error string, such as "str() NYI". Then I rebuilt pytorch and run the commands:

>>> a1 = torch.tensor([[1,0,0,5,0,6]], device=dev)
>>> a2 = torch.nonzero(a1)
>>> a2.shape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: RuntimeError: NYI

That suggest implementing the str() for XLASymIntNodeImpl may not be enough. Do you think we need to register the python dispatcher to the XLA key as we discussed in today's meeting? @Krovatkin Also, Will Constable mentioned he would shared the instruction on how to register python dispatcher to the XLA key. Do know where I can find the instruction?

vanbasten23 avatar Oct 05 '22 00:10 vanbasten23

Okay, I got the c++ stacktrace:

(pytorch) root@t1v-n-cf794107-w-0:/# python3
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch, torch_xla, torch_xla.core.xla_model as xm
>>> dev = xm.xla_device()
>>> a1 = torch.tensor([[1,0,0,5,0,6]], device=dev)
>>> a2 = torch.nonzero(a1)
>>> a2.shape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
import torch, torch_xla
RuntimeError: RuntimeError: NYI
Exception raised from str at /pytorch/c10/core/SymIntNodeImpl.h:83 (most recent call first):
import torch, torch_xla
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7f27d86ea23d in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xdd (0x7f27d86e895d in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: torch_xla::XLATensor::~XLATensor() + 0 (0x7f27d0920950 in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x89002a (0x7f27e3c1302a in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x23436c (0x7f27e35b736c in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #11: <unknown function> + 0x761490 (0x7f27e3ae4490 in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x78e0e6 (0x7f27e3b110e6 in /root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #29: __libc_start_main + 0xeb (0x7f27ed76f09b in /lib/x86_64-linux-gnu/libc.so.6)

>>>

I'm confused. In my local /pytorch/c10/core/SymIntNodeImpl.h, I changed virtual std::string str() to

virtual std::string str() {
    TORCH_CHECK(false, "C10_API SymIntNodeImpl str() is NYI");
  };

, then built it via python setup.py install under pytorch/, then went back to HOME, run the above python command. Wouldn't it pick up my local change?

vanbasten23 avatar Oct 05 '22 23:10 vanbasten23

@vanbasten23 @miladm

I'm not quite seeing what you guys both are seeing in https://github.com/pytorch/xla/pull/4073

the test included in the PR prints :

__str__ BEGIN
__str__ END
XLASymIntNodeImpl

after printing shape

which seems to indicate that we are indeed hitting the XLASymIntNodeImpl implementation?

I'm using the following configuration options:

export XLA_EXPERIMENTAL="nonzero:masked_select"
export XRT_WORKERS="localservice:0;grpc://localhost:40934"
export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"

Krovatkin avatar Oct 06 '22 18:10 Krovatkin

@vanbasten23 @miladm Added an example of printing a static value of a DimensionNode here: https://github.com/pytorch/xla/pull/4073/commits/1ff4bae6b9fb32ecbf4f72a05dd4e513f0a68e60

This is the output:

(pytorch) root@9471890a681a:/home/pytorch/xla/test# python test_str.py
__str__ BEGIN
__str__ END
IR=SizeNode, static=6
after printing shape

Krovatkin avatar Oct 06 '22 18:10 Krovatkin

, then built it via python setup.py install under pytorch/, then went back to HOME, run the above python command. Wouldn't it pick up my local change?

This workflow works perfectly for me. I actually added in

__str__ BEGIN
__str__ END

in pytorch bindings for jit/python/init.cpp and rebuilt pytorch with python setup.py install For this kind of a change, I didn't need to rebuild xla but to be on the safe side you could rebuild XLA as well.

I have a few theories:

  • when you run python setup.py install , it didn't build succesfully?
  • python setup.py install was accidentally run in the wrong folder
  • for some reason python setup.py install didn't overwrite the existing binary package. You could try doing pip uninstall pytorch first and then doing python setup.py install Also you could turn your changes in pytorch into a commit e.g. git add -u, git commit -m "XX" and when you load pytorch you could print torch.version and double check it matches your commit. This way you can be sure you are using the right version of pytorch

Krovatkin avatar Oct 06 '22 18:10 Krovatkin

Thanks for confirming @Krovatkin.

@vanbasten23 will push a PR to support str() here.

miladm avatar Oct 06 '22 21:10 miladm

The bug is fixed. closing.

miladm avatar Oct 14 '22 21:10 miladm