pluiefox issues

Results 10 issues of


pluiefox

NLLB vocabulary missing common Chinese character/tokens

Hi, I use the released NLLB checkpoint to decode flroes Chinese testset, overall the results looks good. However, I found that a lot of very common Chinese characters/tokens are missing...

bug

needs triage

NLLB inference time model loading fails due to inconsistent vocabulary size

Hi, I downloaded the dictionary and 600M NLLB-200-Distilled checkpoint. I failed to load model weights from the checkpoint due to inconsistent vocabulary size. The dictionary has 255997 tokens and the...

question

needs triage

Fixed protected patterns, truecase logic

[BUG]: Using mpirun to launch multi-node training stucks in colossalai.launch_from_openmpi

### 🐛 Describe the bug Running mpirun to lanuch distrtibuted training on 2 nodes (2x8 GPUs) stucks in `colossalai.launch_from_openmpi()` function. The 16 processes can be found using top command on...

bug

[FEATURE]: adding grad_norm logging in training process

### Describe the feature I have been using the colossalai framework for my project and I noticed that there is no way to obtain the `grad_norm` after the backward pass...

enhancement

Dynamic batch support

Machine Translation usually takes dynamically sized batch composed of X tokens instead of X sentences as training input. I'm wondering why deepspeed requires specifying `train_batch_size` and `train_micro_batch_size_per_gpu`, both of which...

Is it possibile to extend a trained BPE model's merge operations?

Hi, here is the case. 1. I pretrained a language model on English-only corpus, using BPE tokenization with vocab_size=32000. 2. I want to continue training the model on Japanese corpus....

sft training example with peft doesn't reduce expected runtime GPU memory compared to full finetuning

Directly running sft example `trl/examples/scripts/sft.py` observes unexpected GPU memory usage: 1. directly running sft example using the command provided in the script observes imbalanced memory usage (**16G\~66G for peft=false and...

[QUESTION] bf16 Parameters and fp32 Gradients

In the README for the distributed optimizer, it is mentioned that when using bf16 training, a combination of bf16 model parameters and fp32 model grads is employed, and the distributed...

`apply_rotary_pos_emb` significantly hurts training efficiency

This can be reproduced by cloning latest Megatron-LM and enabling transformer_engine for `--transformer-impl` instead of using local implementation. The experiments are run in a `nvcr.io/nvidia/pytorch:23.11-py3` container with 8 H800 GPUs....