aabbc-cell issues

Results 5 issues of


                                            aabbc-cell

不使用lmdeploy和swift应该如何进行多图推理

使用的代码是[https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5] 中给的代码，并在多张v100上运行InternVL-Chat-V1-5 ``` path = "./InternVL-Chat-V1-5" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True, device_map='auto').eval() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) pixel_values = load_image('xxx.jpg', max_num=6).to(torch.bfloat16).cuda() generation_config = dict( num_beams=1, max_new_tokens=512, do_sample=False, ) #...

Why is the tag and Caption text predicted by Tag2Text different? Why didn't Tag2Text use specific tags given by user?

只输入一张图像，Tag2Text生成的caption并没有用上它生成全部的tags？此外，当Tag2Text的输入是一张图像和几个specific tags的时候，它生成的caption可能也并不包含specific tags？

推理中将多张图像cat在一起作为输入，出现CUDA out of memory

推理中，我使用1-2张图像cat在一起作为输入时，使用8张v100-32G（device_map="auto"）能正常进行对话，但使用3张及以上图像（例如10张图像）cat在一起作为输入时，报错： t）orch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.14 GiB. GPU 0 has a total capacity of 31.75 GiB of which 8.85 GiB is free. Process 18099 has 22.89...

使用internvl1.5对视频进行进行问答

我希望能使用internvl1.5对视频进行问答，可以采取什么方式？我已尝试过对视频抽帧，并将抽出的多张图像cat在一起作为输入，但过多的图像cat在一起显然会大幅增加我的输入长度，从而在inference的时候显存爆炸。据此，有什么方式能较好使用internvl1.5的对视频输入进行问答吗？

the model weights of the Unmasked Teacher fine tuned on panda

Can you open source the model weights of the Unmasked Teacher fine tuned on panda?