Sam Ade Jacobs
Sam Ade Jacobs
Also, I suggest a comparison of existing prototext with PFE to verify correctness.
@Quentin-Anthony , I had no issue saving checkpoints with Megatron DeepSpeed training of the GPT-350M model with both ZeRO-1 and ZeRO-3. Additional configurations of interest are as follows: 8 V100...
Ulysses is, in principle, attention-type agnostic. Although we haven’t specifically tested Ulysses with Ring Attention, as long as the qkv can be split or sharded along sequence and head dimensions,...
@Momo-Tori , yes, Ulysses is a form of TP in the sense that attention block is head parallel. In general, Ulysses is sequence parallelism + head parallelism. It starts out...
@Kwen-Chen, your input data processing looks good to me. As for your second and third questions, you need a sequence- parallel-aware loss calculation ([see example here](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/core/sequence_parallel/cross_entropy.py)).
@LoggerHead22, we will look into this issue. As an alternative (stopgap measure), please consider using [hpZ component of ZeRO++](https://www.deepspeed.ai/tutorials/zeropp/).
We recommend that you use [DeepSpeed universal checkpoint](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing).
@lleizuo , could you please provide additional details (e.g., model and training hyperparams) to reproduce this issue?
Hi @noob-ctrl, do you have a repro?
Closing, please re-open with a repro if needed.