Yanming W.
Yanming W.
@codeislife99 FYI previous thread about the same issue https://github.com/pytorch/xla/issues/3347. It looks like pytorch recently requires `CUDA_VISIBLE_DEVICES` to be set prior to importing torch.
@codeislife99 It looks like this is a bug in pytorch 1.12 and has been fixed now https://github.com/pytorch/pytorch/issues/80876. We need to figure out how to not rely on `CUDA_VISIBLE_DEVICES` to set...
Maybe some environment issues? I couldn't reproduce it in the CI docker image `gcr.io/tpu-pytorch/xla_base:latest-d8db50a778a39fab0a58436307a3225a6ca06f67.` with `pip install git@https://github.com/huggingface/transformers@06a6a4b`.
It looks like tensorflow grappler has some trouble creating clusters and the error is coming from [here](https://github.com/tensorflow/tensorflow/blob/e9db4aec6714173c1e556b701feda06cc5203380/tensorflow/core/grappler/clusters/virtual_cluster.cc#L50). This step should happen during compilation, maybe xla can skip these optimizations if...
I think this op may not need to be implemented using XLA custom call. I found it can be lowered using broadcast + diff + reduce. You can check out...
To fix this error, I simply reverted commit 0de6ecda97e261528b51709c11a4e7e22a39ca33. I suspect this is a XRT related issue and is not specific to any platforms.
I've created a prototype using xla custom_call and scipy.optimize.linear_sum_assignment C++ api https://github.com/ymwangg/xla/commit/0bba425041ba3e664c682ebb3b430a846949b2ff. This implementation requires data transfer round trips to CPU to do the computation. Ideally we want everything on...
Yes, it's set to be 8 by default now https://github.com/pytorch/xla/blob/32da64f69a9c246186603a168291d7e42e0d3884/torch_xla/__init__.py#L51-L52.
Updating pytorch version should solve this problem. `StorageImpl->mutabe_data()` was recently added https://github.com/pytorch/pytorch/pull/97647.
Hi @skrider, thanks for the great work! Based on my test, this kernel is 1.5-4x faster than the triton equivalent. But when I use it for end-to-end testing in vLLM,...