InternEvo icon indicating copy to clipboard operation
InternEvo copied to clipboard

[Bug] 使用moe的config微调报错

Open wang-benqiang opened this issue 1 year ago • 1 comments
trafficstars

描述该错误

非常感谢您的工作! 我在使用代码进行sft时遇到了一个问题。在不使用moe的config时能够很好的运行,在使用moe的config文件后报错。 运行代码:

torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_MoE4_sft.py --launcher "torch"

报错信息:

Traceback (most recent call last):
  File "train.py", line 324, in <module>
    main(args)
  File "train.py", line 105, in main
    model = initialize_model()
  File "/root/wbq/internlm_moe/InternEvo/internlm/utils/timeout.py", line 102, in wrapper
    result = func(*args, **kwargs)
  File "/root/wbq/internlm_moe/InternEvo/internlm/train/pipeline.py", line 167, in initialize_model
    model = MODEL_INITIALIZER.get_module(module_name=gpc.config.model_type)(**(gpc.config.model))
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 584, in build_model_with_moe_cfg
    return _build_generic_model_1d(num_layers=num_layers, num_chunks=num_chunks, **cfg)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 482, in _build_generic_model_1d
    chunk = PackedFlashInternLm1D(**filter_kwargs(PackedFlashInternLm1D.__init__, kwargs)).to(device)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 356, in __init__
    [
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 357, in <listcomp>
    PackedFlashBaseLayer1D(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 94, in __init__
    self.mixer = MHA(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/multi_head_attention.py", line 364, in __init__
    self.rotary_emb = RotaryEmbedding(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/embedding.py", line 287, in __init__
    self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
TypeError: arange() received an invalid combination of arguments - got (int, int, int, dtype=torch.dtype, device=device), but expected one of:
 * (Number end, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, *, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, Number step, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

环境信息

torch==2.1.0+cu118
transformers<4.30.0
sentencepiece
numpy
tqdm
psutil
packaging
pre-commit
ninja
gputil
pytest
packaging
boto3
botocore
torch-scatter
pyecharts
py-libnuma
pynvml
tensorboard

其他信息

1、我只修改了./configs/7B_MoE4_sft.py中训练集和测试集的地址

wang-benqiang avatar Mar 28 '24 14:03 wang-benqiang