xla icon indicating copy to clipboard operation
xla copied to clipboard

Spmd whether expert parallelism is supported?

Open mars1248 opened this issue 1 year ago • 5 comments

torchxla spmd whether expert parallelism is supported? If it is a moe model, how should it be computed in xla?

❓ Questions and Help

mars1248 avatar May 13 '24 03:05 mars1248

We are actually actively working on a MOE distributed training example, maybe. @alanwaketan can share more details.

JackCaoG avatar May 13 '24 17:05 JackCaoG

Yea, will let you know once we have more information.

alanwaketan avatar May 13 '24 19:05 alanwaketan

Yea, will let you know once we have more information.

@alanwaketan Can you tell me a little bit about your thinking? I want to express the experts in parallel in spmd, and then add custom calls to solve the routing problem of variable length tokens

mars1248 avatar May 14 '24 01:05 mars1248

@alanwaketan Do you have any updates on this issue?

ysiraichi avatar Apr 17 '25 13:04 ysiraichi

Any updates on this issue? We're trying to use torch-xla SPMD mode to run inference on gpt-oss and any help (whether that's example code, or just documentation) would be greatly appreciated.

hshahTT avatar Sep 03 '25 20:09 hshahTT