Binxuan Huang comments

Repositories
Issues
Comments

Results 4 comments of


                                            Binxuan Huang

MOE training Loss inconsistent after resume from old checkpoint

Hi @fanshiqing , if we use the lagecy checkpointing method instead of the distributed checkpointing will we encounter this issue?

Training Script

> We've managed to train mamba by modifying the Huggingface Trainer class. Here is our [implementation](https://github.com/havenhq/mamba-chat/tree/main), we were actually able to train a chat model that seems to perform quite...

bfloat16 overflow during training session

I am using pytorch's FSDP with bf16 for training. Looks like I encountered similar issue with NaN loss.

Add Logits to OpenAI ChatCompletions model

Could we set ```logprobs``` to a large number for vLLM and openai completion API so that we can do the multile choice task using one-token generation?