ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

请教怎么使用swift infer

Open Oukaishen opened this issue 8 months ago • 1 comments

您好,我的infer命令是:

swift infer
--model <已经训练好的一个模型chekpoint路径>
--val_dataset <已经准备好的离线评估数据集,有标签,query-response格式,有额外字段>
--result_path <预期输出目录>
--max_pixels 131712
--max_new_tokens 32
--logprobs true
--train_type full
--torch_dtype bfloat16 \

想获得的帮助是:

  1. val_dataset 中有些url是不能下载的,导致整体程序报错,有无办法跳过这些(类似训练那样)
  2. val_dataset 中有额外字段 + 原始的答案标签,怎么可以保留(使得我可以统计)?

Oukaishen avatar Apr 18 '25 17:04 Oukaishen

我修改了 --infer_backend vllm 推理速度看起来是会快不少,但是在下面的代码

Image

[rank1]: File "/usr/local/lib64/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2723, in all_gather_object [rank1]: all_gather(object_size_list, local_size, group=group) [rank1]: File "/usr/local/lib64/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib64/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3341, in all_gather [rank1]: work = group.allgather([tensor_list], [tensor]) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5 [rank1]: ncclUnhandledCudaError: Call to CUDA function failed.

Oukaishen avatar Apr 23 '25 06:04 Oukaishen

main分支应该不会出现该问题了

这里是数据集很大嘛

Jintao-Huang avatar Jun 04 '25 09:06 Jintao-Huang

This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.

github-actions[bot] avatar Sep 21 '25 00:09 github-actions[bot]

This issue has been automatically closed due to inactivity. If needed, it can be reopened.

github-actions[bot] avatar Oct 11 '25 00:10 github-actions[bot]

您好,我的infer命令是:

swift infer --model <已经训练好的一个模型chekpoint路径> --val_dataset <已经准备好的离线评估数据集,有标签,query-response格式,有额外字段> --result_path <预期输出目录> --max_pixels 131712 --max_new_tokens 32 --logprobs true --train_type full --torch_dtype bfloat16 \

想获得的帮助是:

  1. val_dataset 中有些url是不能下载的,导致整体程序报错,有无办法跳过这些(类似训练那样)
  2. val_dataset 中有额外字段 + 原始的答案标签,怎么可以保留(使得我可以统计)?

你好,请问你实现跳过失败数据的方法了吗,我也遇到了相似的问题,部分音频读取失败,导致推理中断,但是很难手动检查哪些音频有问题

MM-WW55 avatar Dec 06 '25 09:12 MM-WW55