Vitaliy Chiley comments

Results 64 comments of


                                            Vitaliy Chiley

Loss explodes with Flash/Triton Attention

`attn_impl: torch | flash | triton` handle numerical precision differently. `attn_impl: torch` operates under the `with torch.autocast(**kwargs)` context manager should be the most numerically stable; the other attn_impl are a...

Loss explodes with Flash/Triton Attention

I don't think we've implemented anything like that, but you'd probably implement this as [a composer callback](https://docs.mosaicml.com/projects/composer/en/latest/trainer/callbacks.html) which is triggered or depending on exactly what you want to implement and...

install error about trition-pre-mlir

Sorry, I have not seen this error before and do not know how to help. Are you still having this issue?

Checkpoints saving optimizer state names incorrectly

> Model parameters and optimizer states for those params should be saved against the same parameter name. To verify expected behavior, can you, using pure python / pytorch, instantiate a...

updt init

> I also found few issues in some other changes where I am still investigating whether it has influence in existing features. Any updates on this?

updt init

> For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases. we're talking about...

updt init

> self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) Is leaving `num_global_experts` a buffer the only issue with the PR? We could remove it from this PR and open a followup PR.

updt init

All tensor naming is still the same. If `bias=True` the generated state dict will be the same. I only make layer init more conventional (along with using a more conventional...

All2All precision always in fp32

In general it seems as though if the input is not explicitly cast [here](https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L232) (ie we comment out those lines) and the input to the MoE layer is in fp16,...

No Module Named 'torch'

@xwyzsn ninja was removed, then torch was removed, then ninja was re-added. Next logical step is to re-add torch. right??? 😄