xla icon indicating copy to clipboard operation
xla copied to clipboard

Head is broken with `kl_div_backward`

Open JackCaoG opened this issue 2 years ago • 4 comments

🐛 Bug

To Reproduce

Build PyTorch and PyTorch/XLA both from head.

Update: real error is kl_div_backward in https://github.com/pytorch/xla/issues/3682#issuecomment-1174627323


error message
~~Building torch_xla version: 1.13~~
~~XLA Commit ID: 6a03a5dcf6c0c057577a3b9742840040a030298a ~~
~~PyTorch Commit ID: c78bdd7ba04ae514a46258083c67f165608f3d27 ~~
~~Traceback (most recent call last):~~
~~  File "/home/jackcao/pytorch/xla/scripts/gen_lazy_tensor.py", line 58, in ~~
~~    run_gen_lazy_tensor(~~
~~  File "/usr/local/lib/python3.8/dist-packages/torchgen/gen_lazy_tensor.py", line 302, in run_gen_lazy_tensor~~
~~    parsed_yaml = parse_native_yaml(native_yaml_path, tags_yaml_path)~~
~~  File "/usr/local/lib/python3.8/dist-packages/torchgen/gen.py", line 237, in parse_native_yaml~~
~~    _GLOBAL_PARSE_NATIVE_YAML_CACHE[path] = parse_native_yaml_struct(~~
~~  File "/usr/local/lib/python3.8/dist-packages/torchgen/gen.py", line 174, in parse_native_yaml_struct~~
~~    func, m = NativeFunction.from_yaml(e, loc, valid_tags, ignore_keys)~~
~~  File "/usr/local/lib/python3.8/dist-packages/torchgen/model.py", line 548, in from_yaml~~
~~    dispatch_key = DispatchKey.parse(k.strip())~~
~~  File "/usr/local/lib/python3.8/dist-packages/torchgen/model.py", line 142, in parse~~
~~    raise AssertionError(f"unknown dispatch key {value}")~~
~~AssertionError: unknown dispatch key CompositeExplicitAutogradNonFunctional~~
~~  in /home/jackcao/pytorch/aten/src/ATen/native/native_functions.yaml:761:~~
~~    as_strided_(Tensor(a!) self, int[] size, int[] stride, int? storage_offset=None) -> Tensor(a!)~~
~~ Failed to generate lazy files: ['python', '/home/jackcao/pytorch/xla/scripts/gen_lazy_tensor.py']~~

JackCaoG avatar Jul 05 '22 04:07 JackCaoG

Hmm.. I check the native_function.yaml and I think the real error is from https://github.com/pytorch/pytorch/pull/80334. Need to verify locally. Yesterday's wheel build succeed so I am sure this is due to some change merge to PyTorch today.

JackCaoG avatar Jul 05 '22 05:07 JackCaoG

weird.. even if I sync pytorch before https://github.com/pytorch/pytorch/pull/80334, I still see the build error. I don't see any change to the as_strided so this is confusing.

JackCaoG avatar Jul 05 '22 05:07 JackCaoG

I think as_strided issue is something else, it only happens on TPUVM. on ci and my other build machine I see

(pytorch) root@dbc27111d843:/pytorch/xla# python setup.py install
Building torch_xla version: 1.13
XLA Commit ID: 6a03a5dcf6c0c057577a3b9742840040a030298a
PyTorch Commit ID: c78bdd7ba04ae514a46258083c67f165608f3d27
Traceback (most recent call last):
  File "/pytorch/xla/scripts/gen_lazy_tensor.py", line 83, in <module>
    get_device_fn="torch_xla::bridge::GetXlaDevice")
  File "/root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torchgen/gen_lazy_tensor.py", line 357, in run_gen_lazy_tensor
    source_yaml, grouped_native_functions, backend_indices
  File "/root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torchgen/gen_backend_stubs.py", line 149, in parse_backend_yaml
    use_device_guard=use_device_guard,
  File "/root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torchgen/gen_backend_stubs.py", line 122, in create_backend_index
    ), f"Found an invalid operator name: {op_name}"
AssertionError: Found an invalid operator name: kl_div_backward
Failed to generate lazy files: ['python', '/pytorch/xla/scripts/gen_lazy_tensor.py']

on head which should be solved by https://github.com/pytorch/xla/pull/3683

JackCaoG avatar Jul 05 '22 05:07 JackCaoG

Ah OK, I think CompositeExplicitAutogradNonFunctional issue is related to I have torch 1.12 preinstalled on TPUVM. The real build error is related to kl_div_backward

JackCaoG avatar Jul 05 '22 05:07 JackCaoG