xla 2.4 backport PR request list

The issue is to track 2.4 release backport.

For any PRs you want to backport to 2.4, please reply with following:

Original PR link Reason to backport 2.4 backport PR link

Jun 11 '24 17:06 bhavya01

original PR: https://github.com/pytorch/xla/pull/7157
Reason to backport: The PR is related to a planned release feature urgent fix: urgent fix for while_loop in 2.4 release
2.4 backport PR link: https://github.com/pytorch/xla/pull/7306

Jun 11 '24 18:06 ManfeiBai

Original PR: https://github.com/pytorch/xla/pull/7155
Reason to backport: Fix a bug in distributed checkpointing with padded tensors
2.4 backport PR link: https://github.com/pytorch/xla/pull/7243

Jun 11 '24 19:06 jonb377

original PR: #7236
Reason to backport: Include build dependencies for CUDA 12.4
2.4 backport PR link: #7244

Jun 11 '24 20:06 bhavya01

original PR: #7219
Reason to backport: Minor addition to Triton functionality to support CUDA plugins
2.4 backport PR link: #7303

Jun 12 '24 20:06 bhavya01

original PR: https://github.com/pytorch/xla/pull/7254 https://github.com/pytorch/xla/pull/7258
Reason to backport: OpenXLA pin update requested because new telemetry in PyTorch/xla needs to be enabled, requested by Aman
2.4 backport PR link: https://github.com/pytorch/xla/pull/7261 https://github.com/pytorch/xla/pull/7262

Jun 13 '24 02:06 vanbasten23

Original PRs: #7249 and #7268
Reason to backport: Enables new libtpu telemetry and detection of CUDA plugin package (torch_xla_cuda_plugin) by default.
Backport link: #7270

Jun 13 '24 22:06 will-cromar

Original PR: Enable bucketized all-reduce for gradients #7216
Reason to backport: Parity with Neuron branch r2.1_aws_neuron
Backport link: WIP

Jun 14 '24 16:06 jeffhataws

original PR: https://github.com/pytorch/xla/pull/7278
backport PR: https://github.com/pytorch/xla/pull/7284

Reason: This is a regression. Without this fix, PyTorch/XLA will always output a log spam about PJRT version upon loading.

Risk: Low.

Jun 14 '24 21:06 JackCaoG

origional pr: https://github.com/pytorch/xla/pull/7271
backport pr: https://github.com/pytorch/xla/pull/7285

Reason: This is a regression fix. Without this fix a bunch of xlml test will fail

Risk: Low since it is just a revert.

Jun 14 '24 21:06 JackCaoG

original pr: https://github.com/pytorch/xla/pull/7301
backport pr: https://github.com/pytorch/xla/pull/7310

Reason: It's a cherry-pick hot fix for AttributeError: module 'numpy' has no attribute 'product' in r2.4 CI

Risk: Low since master issue has beem fixed by the original pr

Jun 18 '24 20:06 ManfeiBai

original prs

https://github.com/pytorch/xla/pull/7234
https://github.com/pytorch/xla/pull/7246
https://github.com/pytorch/xla/pull/7256
https://github.com/pytorch/xla/pull/7322
https://github.com/pytorch/xla/pull/7327
https://github.com/pytorch/xla/pull/7328
https://github.com/pytorch/xla/pull/7341
https://github.com/pytorch/xla/pull/7631

backport prs

https://github.com/pytorch/xla/pull/7611
https://github.com/pytorch/xla/pull/7643
https://github.com/pytorch/xla/pull/7649
https://github.com/pytorch/xla/pull/7666
https://github.com/pytorch/xla/pull/7668
https://github.com/pytorch/xla/pull/7669
https://github.com/pytorch/xla/pull/7673

Reason: I want to enable the eager mode for 2.4 release since I want to make it default for the 2.6

Risk low, all of the features are guarded by torch_xla.experimental.eager_mdoe, these prs should be no op is eager mode is disabled.

Jul 02 '24 18:07 JackCaoG

Original PR

#7617
#7329

Backport PR

#7616
#7618

Reason: To fix CI

Risk: Low, since this doesn't change anything in torch_xla library.

Jul 02 '24 23:07 bhavya01

Original PR: #7640

Backport PR: #7684

Reason: To fix upstream pytorch build

Risk: Low, this doesn't change anything in torch_xla library

Jul 15 '24 17:07 bhavya01

Original: https://github.com/pytorch/xla/pull/7231

Backport: https://github.com/pytorch/xla/pull/7708

Support MoE.

Jul 17 '24 23:07 alanwaketan

Original PR

https://github.com/pytorch/xla/pull/7700
https://github.com/pytorch/xla/pull/7710

Backport PR:

https://github.com/pytorch/xla/pull/7717

Reason: Back port docs for eager mode

Risk: low, just a doc

Jul 19 '24 22:07 JackCaoG