ms-swift 有关Internvl-1.5的微调实验（AttributeError: 'NoneType' object has no attribute 'shape'）

@hjh0119 在1.5版本微调训练结束后，按照教程里面的推理命令，并加载了本地的权重，我使用的infer命令如下：

CUDA_VISIBLE_DEVICES=0,1 swift infer  --ckpt_dir output/internvl-chat-v1_5/v0-20240512-191616/checkpoint-25 --load_dataset_config true  --dtype bf16  --model_id_or_path xxxxx/InternVL/pretrained/InternVL-Chat-V1-5

但在加载后出现了报错情况，不知之前测试时有没有发生

Traceback (most recent call last):
  File "/data2/renyw/PythonWorkspace/FM-LLM/swift/swift/cli/infer.py", line 5, in <module>
    infer_main()
  File "/data2/renyw/PythonWorkspace/FM-LLM/swift/swift/utils/run_utils.py", line 27, in x_main
    result = llm_x(args, **kwargs)
  File "/data2/renyw/PythonWorkspace/FM-LLM/swift/swift/llm/infer.py", line 376, in llm_infer
    if args.show_dataset_sample >= 0 and val_dataset.shape[0] > args.show_dataset_sample:
AttributeError: 'NoneType' object has no attribute 'shape'

May 13 '24 13:05 MVP-D77

@hjh0119 除此之外，有个问题是，对于internvl1.5相对于1.2版本很重要的改变之一是分辨率，它可以把分辨率比较大的图片分成多个小尺寸图片放进batch里面输入，不知在微调阶段，图片是会被直接resize到固定尺寸，还是动态分割成多张图片，如果是后者可否控制分割图片的张数？谢谢您的回答

May 13 '24 13:05 MVP-D77

第一个问题用最新的代码会出现吗第二个问题参考https://github.com/modelscope/swift/blob/main/swift/llm/utils/vision_utils.py#L74C1-L90C24 用的官方的处理图片的逻辑，会根据图片大小计算ViT的patch数

May 14 '24 11:05 hjh0119

@hjh0119 现在用最新的代码直接推理微调后的模型，还是会报相同的错误，请问有测试过微调后的推理实验嘛，有没有成功的案例，想请教一下谢谢另外对于第二个问题，这个根据图片大小计算ViT的patch数目，这个最大max_num在微调时可以加以控制么，有没有预留的参数可以使用？我想做visual grounding 任务，可能并不需要切成多个patch，现在微调时默认情况是都会把大的图片切成小的图片计算ViT的patch数目嘛

def load_image(img_path, input_size=448, max_num=6):
    if isinstance(img_path, str):
        img_path = img_path.strip()
        if img_path.startswith('http'):
            content = requests.get(img_path).content
            image = Image.open(BytesIO(content))
        else:
            image = Image.open(img_path)
    else:
        image = img_path
    if image.mode != 'RGB':
        image = image.convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values
    ```

May 15 '24 07:05 MVP-D77

第一个问题找到bug了，正在修复第二个问题单独设置参数可能有点臃肿，目前是对齐官方实现

May 15 '24 09:05 hjh0119

fixed https://github.com/modelscope/swift/pull/937

May 15 '24 13:05 hjh0119