pluiefox

Results 10 issues of pluiefox

Hi, I use the released NLLB checkpoint to decode flroes Chinese testset, overall the results looks good. However, I found that a lot of very common Chinese characters/tokens are missing...

bug
needs triage

Hi, I downloaded the dictionary and 600M NLLB-200-Distilled checkpoint. I failed to load model weights from the checkpoint due to inconsistent vocabulary size. The dictionary has 255997 tokens and the...

question
needs triage

### 🐛 Describe the bug Running mpirun to lanuch distrtibuted training on 2 nodes (2x8 GPUs) stucks in `colossalai.launch_from_openmpi()` function. The 16 processes can be found using top command on...

bug

### Describe the feature I have been using the colossalai framework for my project and I noticed that there is no way to obtain the `grad_norm` after the backward pass...

enhancement

Machine Translation usually takes dynamically sized batch composed of X tokens instead of X sentences as training input. I'm wondering why deepspeed requires specifying `train_batch_size` and `train_micro_batch_size_per_gpu`, both of which...

Hi, here is the case. 1. I pretrained a language model on English-only corpus, using BPE tokenization with vocab_size=32000. 2. I want to continue training the model on Japanese corpus....

Directly running sft example `trl/examples/scripts/sft.py` observes unexpected GPU memory usage: 1. directly running sft example using the command provided in the script observes imbalanced memory usage (**16G\~66G for peft=false and...

In the README for the distributed optimizer, it is mentioned that when using bf16 training, a combination of bf16 model parameters and fp32 model grads is employed, and the distributed...

This can be reproduced by cloning latest Megatron-LM and enabling transformer_engine for `--transformer-impl` instead of using local implementation. The experiments are run in a `nvcr.io/nvidia/pytorch:23.11-py3` container with 8 H800 GPUs....