Rui issues

Results 7 issues of

Rui

Part 2. Introduce multi-node SPMD initialization for Neuron

In this PR, we adapt to account for a new initialization path that supports multi-node SPMD in Neuron. In order to minimize this change, we retain the `xla.init()` API, but...

Assert on empty PJRT buffers

Simple CR to avoid a segmentation fault when there are placeholder tensors involved, as we attempt to de-reference the device from the buffer. It fixes the seg fault for https://github.com/pytorch/xla/issues/9049....

Torchax regressing basic TRN tests (ValueError: Invalid value "TRACE" for JAX flag jax_logging_level)

## 🐛 Bug Test: ``` def test_sharded_matmul(tensor_a_shape, tensor_b_shape, mesh_shape, sharding_spec_a, sharding_spec_b): cpu_device = torch.device("cpu") neuron_device = xm.xla_device() device_ids = np.array(range(NUM_DEVICES)) mesh = Mesh(device_ids, mesh_shape, ("tp1", "tp2")) tensor_a_cpu = torch.rand(tensor_a_shape, dtype=torch.float32,...

bug

torchxla2

[RFC] MPMD+SPMD Pipeline Parallelism

## 🚀 Feature We propose an accelerator-agnostic, hybrid Single-Program Multiple-Data (SPMD)/Multiple-Program Multiple-Data (MPMD) Pipeline Parallelism implementation in PyTorch XLA. The key objectives are: * Enable efficient model-parallel training for large...

enhancement

distributed

RFC

Support SPMD placeholder tensors

This PR is an extension to the placeholder feature https://github.com/pytorch/xla/issues/8612 that extends the functionality to accommodate sharded tensors for SPMD. It simultaneously fixes a typo in the existing binding for...

Docker images for all the nightlies / release Python versions

We have only created docker images for Python 3.10 since PyTorch/XLA 2.1. This has limited how we seamlessly debug and test out changes for any given Python version, particularly for...