xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

When seq_parallel_world_size is set to a value greater than 1, should use_varlen_attn not be set to true?

Open Fovercon opened this issue 1 year ago • 2 comments

I'm working on the 32k long text SFT for Qwen2 72b. When I set seq_parallel_world_size to greater than one and use_varlen_attn to true, an error occurs. After checking, the error message is an assert error, indicating that the length of my input_ids sequence should be divisible by seq_parallel_world_size. Once I padded the sequence to the appropriate length, this error was resolved. However, after several iterations during training, the loss becomes NaN. image

Here are my specific config: use_varlen_attn = True `prompt_template = PROMPT_TEMPLATE.qwen_chat max_length = 32768 pack_to_max_length = True

parallel

sequence_parallel_size = 4

Scheduler & Optimizer

batch_size = 1 # per_device accumulative_counts = 32 accumulative_counts *= sequence_parallel_size dataloader_num_workers = 4 max_epochs = 2 optim_type = AdamW lr = 2e-6 betas = (0.9, 0.999) weight_decay = 0 max_norm = 1 # grad clip warmup_ratio = 0.1`

Fovercon avatar Sep 27 '24 07:09 Fovercon

同见过这个bug, SP要求输入token长度能被整除,但是实际上没能成功对齐,最后会导致训练label出问题 xtuner/xtuner/dataset/utils.py 中,可以设置一个 参数 :drop_last 这样丢掉后面的内容就可以了。

FlyCarrot avatar Nov 01 '24 09:11 FlyCarrot

我也遇到这个问题了。。

lljzhgxd avatar Dec 06 '24 05:12 lljzhgxd