Vitaliy Chiley
Vitaliy Chiley
`attn_impl: torch | flash | triton` handle numerical precision differently. `attn_impl: torch` operates under the `with torch.autocast(**kwargs)` context manager should be the most numerically stable; the other attn_impl are a...
I don't think we've implemented anything like that, but you'd probably implement this as [a composer callback](https://docs.mosaicml.com/projects/composer/en/latest/trainer/callbacks.html) which is triggered or depending on exactly what you want to implement and...
Sorry, I have not seen this error before and do not know how to help. Are you still having this issue?
> Model parameters and optimizer states for those params should be saved against the same parameter name. To verify expected behavior, can you, using pure python / pytorch, instantiate a...
> I also found few issues in some other changes where I am still investigating whether it has influence in existing features. Any updates on this?
> For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases. we're talking about...
> self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) Is leaving `num_global_experts` a buffer the only issue with the PR? We could remove it from this PR and open a followup PR.
All tensor naming is still the same. If `bias=True` the generated state dict will be the same. I only make layer init more conventional (along with using a more conventional...
In general it seems as though if the input is not explicitly cast [here](https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L232) (ie we comment out those lines) and the input to the MoE layer is in fp16,...
@xwyzsn ninja was removed, then torch was removed, then ninja was re-added. Next logical step is to re-add torch. right??? 😄