中文版 notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb

Qwen docs: https://qwen.readthedocs.io/en/latest/training/ms_swift.html

English Version

We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The CPT/SFT/DPO/GRPO for Qwen3/Qwen3-MoE has been supported at the first time by the ms-swift large model training framework. Meanwhile, it also supports the Megatron training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is 10 times faster than the training speed achieved using transformers on MoE models.

We will showcase a runnable fine-tuning demo and provide the format for custom datasets.

Before starting the fine-tuning process, please ensure that your environment is properly set up.

# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook

# Training GPU memory: 22GB
# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

The format for a custom dataset is as follows (the system field is optional). Simply specify --dataset <dataset_path>:

For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

Datasets without thinking can be handled in two ways to reduce the disruption of thinking during fine-tuning:

Option 1: During training, additionally specify --loss_scale ignore_empty_think to ignore the loss calculation for <think>\n\n</think>\n\n, preventing the loss of thinking ability.

Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

Option 2: Add /no_think to the query in the dataset to avoid the loss of thinking ability.

Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)

ref: https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

Inference and test the fine-tuning results:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048

Qwen3-8B GRPO

Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html

The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:

pip install math_verify==0.5.2

The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

You can also train with custom reward functions or reward models. Columns in the dataset will be passed into **kwargs of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.

The training script is as follows:

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.

For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html

We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:

More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Please ensure that the weight saving paths are the same for both nodes.
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

Training loss (partial):

The custom dataset format is the same as swift sft, which can be found above. Specify --dataset <dataset_path>.

Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using megatron sft and swift sft:

	Megatron-LM	DeepSpeed-ZERO2	DeepSpeed-ZERO3
Training Speed	9.6s/it	-	91.2s/it
GPU Memory Usage	16 * 60GiB	OOM	16 * 80GiB

中文版

非常高兴听到Qwen3和Qwen3-MoE的开源， ms-swift大模型训练框架首发支持了Qwen3/Qwen3-MoE的CPT/SFT/DPO/GRPO，同时支持了Qwen3/Qwen3-MoE的Megatron训练(CPT/SFT)实现，在MoE模型上相比transformers实现的训练速度快10倍。

我们将展示可运行的微调demo，并给出自定义数据集的格式。

在开始微调之前，请确保您的环境已准备妥当。

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

对Qwen3-8B进行训练的脚本如下，可在魔搭提供的免费算力A10中运行：https://modelscope.cn/my/mynotebook

# 训练显存：22GB
# 你可以指定`--dataset AI-ModelScope/alpaca-gpt4-data-zh`来跑通实验
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

自定义数据集格式如下（system字段可选），指定--dataset <dataset_path>即可：

参考自定义数据集文档：https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html

{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\n浙江的省会在杭州。"}]}

不带思考的数据集可以有两种处理方式，来减少微调过程对思考的破坏：

方案一：在训练时额外指定--loss_scale ignore_empty_think，忽略<think>\n\n</think>\n\n的损失计算，避免思考能力的丢失。

demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh

{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}

方案二：在数据集的query中额外增加/no_think，避免思考能力的丢失。

demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh

{"messages": [{"role": "user", "content": "浙江的省会在哪？ /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}

10分钟快速自我认知微调Demo（显存占用：22GB）

ref: https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

推理测试微调效果：

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048

Qwen3-8B GRPO

以Qwen3-8B为例，下面使用ms-swift框架对进行GRPO训练。更多关于GRPO，可以参考GRPO文档：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html

使用AI-MO/NuminaMath-TIR作为数据集，并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境：

pip install math_verify==0.5.2

自定义数据集格式与SFT类似，其中assistant部分不必需。如果使用accuracy奖励，则需要solution列来计算准确率。

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

也可以使用自定义的奖励函数/奖励模型进行训练，数据集中的列会传到奖励函数的**kwargs中，自定义奖励函数的例子参考：swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

在训练过程中，我们使用vLLM来加速采样过程。设置num_infer_workers=8，我们为每个device都部署一个vLLM engine来加速采样过程。

训练脚本如下：

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true

Qwen3-30B-A3B MoE SFT（Megatron-SWIFT）

ms-swift引入了Megatron的并行技术来加速大模型的训练，包括数据并行、张量并行、流水线并行、序列并行，上下文并行，专家并行。支持Qwen3、Qwen3-MoE、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。

对于环境准备（镜像）和HF与MCore模型权重的转换，可以参考Megatron-SWIFT训练文档，这里不进行介绍：https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

我们使用DLC启动训练命令，训练环境是2机8 * 80GiB A800：

更多多节点启动方式参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# 请确保两个节点的保存权重路径相同
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

训练loss图（部分）：

效果截图：

自定义数据集格式与swift sft相同，可以在本文上方找到，指定--dataset <dataset_path>即可。

使用megatron sft和swift sft进行Qwen3-30B-A3B模型全参数训练速度/显存占用对比如下：

	Megatron-LM	DeepSpeed-ZeRO2	DeepSpeed-ZeRO3
训练速度	9.6s/it	-	91.2s/it
显存占用	16 * 60GiB	OOM	16 * 80GiB

Apr 28 '25 16:04 Jintao-Huang

Model Inference:

Thinking Mode:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --max_model_len 8192

<<<  who are you?
<think>
Okay, the user is asking "who are you?" Let me start by introducing myself as Qwen, the large language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and engaging in conversations. But I need to keep it concise. Also, the user might want to know how I can assist them. Maybe I should ask how I can help them today. Let me check if there's anything else important to include. Oh, I should make sure the tone is friendly and approachable. Alright, that should cover it.
</think>

Hello! I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, such as answering questions, creating content, writing stories, coding, and more. How can I help you today? 😊

<<< who are you? /no_think
<think>

</think>

I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I help you today?

Non-Thinking Mode:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --max_model_len 8192 \
    --response_prefix '<think>\n\n</think>\n\n'

<<< who are you?
<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. I am designed to assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I assist you today?

Model Quantization:

Qwen3-32B-AWQ: https://modelscope.cn/models/swift/Qwen3-32B-AWQ

Qwen3-30B-A3B-AWQ: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ

Qwen3-235B-A22B-AWQ: https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ

Apr 28 '25 19:04 Jintao-Huang

请问vllm版本选择多少

Apr 29 '25 03:04 EvilCalf

vllm==0.8.5

Apr 29 '25 03:04 Jintao-Huang

将HF格式的权重转为Megatron格式失败：

CUDA_VISIBLE_DEVICES=0 \ swift export \ --model Qwen/Qwen3-30B-A3B \ --to_mcore true \ --torch_dtype bfloat16 \ --output_dir Qwen/Qwen3-30B-A3B-mcore

errors: [rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/cli/export.py", line 5, in <module> [rank0]: export_main() [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/export/export.py", line 50, in export_main [rank0]: return SwiftExport(args).main() [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/base.py", line 47, in main [rank0]: result = self.run() [rank0]: ^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/export/export.py", line 34, in run [rank0]: convert_hf2mcore(args) [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/convert.py", line 72, in convert_hf2mcore [rank0]: assert megatron_model_meta is not None, f'Model: {args.model} is not supported.' [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: AssertionError: Model: Qwen/Qwen3-30B-A3B is not supported.

Apr 29 '25 03:04 sosofun

It's still on the main branch now, and the version ms-swift==3.4.0 will be released tonight.

Apr 29 '25 04:04 Jintao-Huang

请求增加对Qwen3-8B的自我认知训练的NoteBook文件

我在魔塔提供的PAI-DSW中使用“self-cognition-sft.ipynb”训练“Qwen3-8B”时注意到该NoteBook文件无法训练“Qwen3”模型。

Apr 29 '25 04:04 NianBroken

能否添加全参数微调的脚本？

Apr 29 '25 06:04 yxk9810

You can refer to the example here and modify the --model parameter accordingly.

https://github.com/modelscope/ms-swift/blob/main/examples/train/full/qwen2_5_32b.sh

Apr 29 '25 06:04 Jintao-Huang

请求增加对Qwen3-8B的自我认知训练的NoteBook文件

我在魔塔提供的PAI-DSW中使用“self-cognition-sft.ipynb”训练“Qwen3-8B”时注意到该NoteBook文件无法训练“Qwen3”模型。

已加入自我认知微调的demo

Apr 29 '25 08:04 Jintao-Huang

If I currently have data without a reasoning process, but I want to use this data to fine-tune Qwen3, should I simply add /no_think after the prompt and prefix the response with <think>\n\n</think>\n\n?

Apr 29 '25 09:04 qingzhong1

Perhaps you can refer to this for a solution:

https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835

Apr 29 '25 09:04 Jintao-Huang

已加入自我认知微调的demo

如何将微调成功后的模型导出为GGUF格式？请求增加一个用于将通过ms-swift微调后的模型转为GGUF格式文件的Notebook文件

Apr 29 '25 11:04 NianBroken

Perhaps you can refer to this for a solution:

ms-swift/swift/llm/dataset/dataset/llm.py

Line 835 in 51cafe5

row['query'] = row['query'] + ' /no_think'

@Jintao-Huang 在不采用推理的情况下，是否仍然可以使用Qwen2.5 的模板微调模型？

Apr 29 '25 12:04 stephen-nju

When using --packing true, please additionally use --attn_impl flash_attn. This was missed in the best practices.

Apr 29 '25 23:04 Jintao-Huang

在华为NPU上运行Swift deploy失败：

[INFO:swift] model_kwargs: {'device_map': 'npu:0'}
Loading checkpoint shards:   0%|                                                                                                  | 0/5 [00:00<?, ?it/s][2025-05-01 05:26:45,878] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[INFO:swift] Successfully registered `/data4/code185/ms-swift/swift/llm/dataset/data/dataset_info.json`.
Loading checkpoint shards:   0%|                                                                                                  | 0/5 [01:28<?, ?it/s]
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data4/code185/ms-swift/swift/llm/infer/deploy.py", line 207, in deploy_main
    SwiftDeploy(args).main()
  File "/data4/code185/ms-swift/swift/llm/infer/deploy.py", line 39, in __init__
    super().__init__(args)
  File "/data4/code185/ms-swift/swift/llm/infer/infer.py", line 32, in __init__
    model, self.template = prepare_model_template(args)
  File "/data4/code185/ms-swift/swift/llm/infer/utils.py", line 144, in prepare_model_template
    model, processor = args.get_model_processor(**kwargs)
  File "/data4/code185/ms-swift/swift/llm/argument/base_args/base_args.py", line 274, in get_model_processor
    return get_model_tokenizer(**kwargs)
  File "/data4/code185/ms-swift/swift/llm/model/register.py", line 571, in get_model_tokenizer
    model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
  File "/data4/code185/ms-swift/swift/llm/model/register.py", line 272, in get_model_tokenizer_with_flash_attn
    return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
  File "/data4/code185/ms-swift/swift/llm/model/register.py", line 241, in get_model_tokenizer_from_local
    model = automodel_class.from_pretrained(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
  File "/data4/code185/ms-swift/swift/llm/model/patcher.py", line 282, in _new_from_pretrained
    return from_pretrained(cls, *args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4399, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4833, in _load_pretrained_model
    disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 787, in _load_state_dict_into_meta_model
    param = param[...]
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

似乎是/data4/code185/ms-swift/swift/llm/model/patcher.py 导致的，请问有什么办法可以解决吗，谢谢

Apr 30 '25 22:04 Gpwner

应该不是这里的原因，你看看使用npu能不能跑起来示例代码

https://modelscope.cn/models/Qwen/Qwen3-8B

May 01 '25 02:05 Jintao-Huang

请求增加一个用于将通过ms-swift微调后的模型转为GGUF格式文件的Notebook文件

May 01 '25 08:05 NianBroken

sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir

sft.py报错不支持直接用--model

May 02 '25 01:05 zhenhua

sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir

sft.py报错不支持直接用--model

升级一下swift>=3.4.0

May 02 '25 01:05 Jintao-Huang

Qwen3-30B-A3B训练成功，但Qwen3-32B megatron sft报错：

2025-05-02T03:37:00.069008389Z [rank24]: raise RuntimeError( 2025-05-02T03:37:00.069009658Z [rank24]: torch._dynamo.exc.TorchRuntimeError: Failed running call_function (*(FakeTensor(..., device='cuda:0', size=(90880, 37984)), (FakeTensor(..., device='cuda:0', size=(90880,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(90752,), dtype=torch.int64))), **{}): 2025-05-02T03:37:00.069011338Z [rank24]: Attempting to broadcast a dimension of length 90752 at -1! Mismatching argument at index 1 had torch.Size([90752]); but expected shape should be broadcastable to [90880] 2025-05-02T03:37:00.069012818Z 2025-05-02T03:37:00.069013908Z [rank24]: from user code: 2025-05-02T03:37:00.069015038Z [rank24]: File "xxxx/Megatron-LM-0.11.0/megatron/core/fusions/fused_cross_entropy.py", line 37, in calculate_predicted_logits 2025-05-02T03:37:00.069016538Z [rank24]: VocabParallelCrossEntropy.calculate_predicted_logits( 2025-05-02T03:37:00.069017918Z [rank24]: File "xxx/Megatron-LM-0.11.0/megatron/core/tensor_parallel/cross_entropy.py", line 59, in calculate_predicted_logits 2025-05-02T03:37:00.069019567Z [rank24]: predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] 2025-05-02T03:37:00.069020817Z 2025-05-02T03:37:00.069022057Z [rank24]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information 2025-05-02T03:37:00.069023517Z 2025-05-02T03:37:00.069024607Z 2025-05-02T03:37:00.069025767Z [rank24]: You can suppress this exception and fall back to eager by setting: 2025-05-02T03:37:00.069027127Z [rank24]: import torch._dynamo 2025-05-02T03:37:00.069028297Z [rank24]: torch._dynamo.config.suppress_errors = True

May 02 '25 03:05 llp1992

可以看看是哪里抛出来的嘛，报错信息完整一些，最好是截图

May 02 '25 06:05 Jintao-Huang

sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir sft.py报错不支持直接用--model

升级一下swift>=3.4.0

嗯，升级后已经解决了

May 02 '25 07:05 zhenhua

train好的moe模型有测过benchmark吗？担心有数值问题

May 02 '25 10:05 no-execution

可以看看是哪里抛出来的嘛，报错信息完整一些，最好是截图

Qwen3的dense模型，megatron训练都会报这个错

May 02 '25 12:05 llp1992

有swift的报错栈嘛，这里全是torch的

May 02 '25 12:05 Jintao-Huang

train好的moe模型有测过benchmark吗？担心有数值问题

之前测过qwen2.5-7b的。qwen3-moe这测过转换精度，训练初始loss和grad_norm都是正常的，训练500个step后人工测过效果是正常的，不太会有问题

May 02 '25 13:05 Jintao-Huang

有swift的报错栈嘛，这里全是torch的

要不你们跑下试试？

May 02 '25 14:05 llp1992

https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
megatron sft \
    --load Qwen3-8B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity selective \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true