soft-moe-pytorch
soft-moe-pytorch copied to clipboard
Not receiving grads with cpu_besides?
Hi Phil,
I have been working with @tomaarsen of HF and @haileyschoelkopf of EAI testing soft moe.
One issue that was occurring was that the tensors were not contiguous:
gathered_tensors = all_gather_same_dim(t)
return AllGatherFunction.apply(x, self.dim, sizes)
x, batch_sizes = all_gather_variable_dim(x, dim = dim, sizes = sizes)
gathered_tensors = all_gather_same_dim(t)
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be contiguous
Adding .contiguous() to t in all_gather_same_dim() seemed to resolve this issue:
def all_gather_same_dim(t):
world_size = dist.get_world_size()
gathered_tensors = [torch.empty_like(t.contiguous(), device = t.device, dtype = t.dtype) for i in range(world_size)]
dist.all_gather(gathered_tensors, t.contiguous())
return gathered_tensors
But after another issue presented itself where the parameter indices were not receiving grads during backward pass:
Parameter indices which did not receive grad for rank 1: 8 9 10 11 12 13 14 15 30 31 32 33 34 35 36 37 52 53 54 55 56 57 58 59 74 75 76 77 78 79 80 81
Parameter indices which did not receive grad for rank 0: 16 17 18 19 20 21 22 23 38 39 40 41 42 43 44 45 60 61 62 63 64 65 66 67 82 83 84 85 86 87 88 89
We diagnosed this back to self.all_experts_to_cpu_besides(expert_slice):
# get the experts in use
self.all_experts_to_cpu_besides(expert_slice)
experts = self.experts[expert_slice]
By commenting out self.all_experts_to_cpu_besides(expert_slice) the script would then run and loss would decrease seemingly normally with amp.
Do you have any idea why the above issue would occur or how it should properly be resolved?
Always greatly appreciate your help.
Thank you,
Enrico
@conceptofmind thanks for pointing out the need for a contiguous
for the latter issue, i don't really know off the top of my head, not without spending some time debugging
is this on one machine? maybe the logic for determining which expert is activated can be moved to init and the device can be set before forward? would welcome a PR if you figure it out, now that mixture of experts is becoming a hot thing
@conceptofmind i thought you were working on startups with Aran? lol
that's what Aran told me last i chat with him
regardless, it is cool you are all working on MoE! it needs more love in the open source space
@lucidrains Regarding 8c3fedbb92e9c98ed6bd6e80a797fa4c2f14b32c, t also needed to be converted to contiguous, e.g. note the second-to-last line:
def all_gather_same_dim(t):
world_size = dist.get_world_size()
gathered_tensors = [torch.empty_like(t.contiguous(), device = t.device, dtype = t.dtype) for i in range(world_size)]
dist.all_gather(gathered_tensors, t.contiguous())
return gathered_tensors
@tomaarsen oops, fixed
@conceptofmind thanks for pointing out the need for a contiguous
for the latter issue, i don't really know off the top of my head, not without spending some time debugging
is this on one machine? maybe the logic for determining which expert is activated can be moved to init and the device can be set before forward? would welcome a PR if you figure it out, now that mixture of experts is becoming a hot thing
Hi Phil,
Thank you for the response.
We are happy to continue to diagnose the issue and open a PR. Can also provide some DDP training code later too.
This is currently one machine with 8 GPUs run using torchrun --nnodes=1 --nproc_per_node=8 script.py.
I should also clarify a little bit that the inclusion of find_unused_parameters=True was required in DDP as well as commenting out self.all_experts_to_cpu_besides(expert_slice):
self.model = DDP(
self.model,
device_ids=[self.local_rank],
output_device=self.local_rank,
find_unused_parameters=True,
gradient_as_bucket_view=True
)
If find_unused_params is not set to True and self.all_experts_to_cpu_besides(expert_slice) remains this error occurs about device placement:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu!
exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and CPU!
We checked to see if the DDP model was on cuda and it looked like so:
cuda:0
cuda:1
cuda:7
cuda:6
cuda:4
cuda:2
cuda:3
cuda:5
@conceptofmind i thought you were working on startups with Aran? lol
that's what Aran told me last i chat with him
regardless, it is cool you are all working on MoE! it needs more love in the open source space
Aran and I did cofound a startup together! We will hopefully have some interesting things related to OSS MOE out in the near future!
We have been trying to organize research efforts across OSS organizations such as EAI, HF, LAION, SAI, etc. as we feel like this type of collaboration will lead to much more fruitful results.
yup, makes sense re: ddp settings
a PR would be a great contribution! there's really only a handful of people who are working on MoE
ok cool, looking forward to seeing what you and Aran end up building!
are you aware that soft moe will not work in LLMs? also have this built https://github.com/lucidrains/st-moe-pytorch
Yes, we are not looking to apply this for LLMs :)
are you aware that soft moe will not work in LLMs? also have this built https://github.com/lucidrains/st-moe-pytorch
Hopefully can provide more information to fulfill that curiosity soon :smile:
Can send an email or update here with some preliminary results.
We are aware that this type of MOE is incompatible with LLMs but think it can be applied to some other interesting use cases.
Going to test out st-moe as well!
@conceptofmind yes indeed, i've seen some papers using mixture of experts in text-to-image models with great success
nice, as long as you are aware!
@conceptofmind yes indeed, i've seen some papers using mixture of experts in text-to-image models with great success
nice, as long as you are aware!
Hi Phil,
We definitely appreciate you bringing notice to this. I imagine it will save someone the confusion regarding soft-moe and LLMs if they read this issue in the future!
RE: if you all ever get stuck, if you can get me ssh access to one machine with multiple GPUs, i can put in some hours into debugging too
Absolutely! If we do get stuck I can definitely get you access to one machine or multiple machines through LAION/SAI or with one of our grants from Modal/Latitude. Currently talking to Huggingface about compute allocations as well.
Thank you,
Enrico