mindnlp
mindnlp copied to clipboard
CLIPProcessor和CLIPScore处理图像后获得的feature与transforemers结果差距较大
Describe the bug/ 问题描述 (Mandatory / 必填) A clear and concise description of what the bug is. CLIPProcessor和CLIPScore处理图像后获得的feature与transforemers结果差距较大
- Hardware Environment(
Ascend/GPU/CPU) / 硬件环境:
Please delete the backend not involved / 请删除不涉及的后端: CPU
-
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.7.0.Bxxx) :2.2.13 -- Python version (e.g., Python 3.7.5) : 2.9.19 -- OS platform and distribution (e.g., Linux Ubuntu 16.04): windows11 -- GCC/Compiler version (if compiled from source):
-
Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative/Graph):
Please delete the mode not involved / 请删除不涉及的模式:
To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior: `from mindnlp.transformers import CLIPProcessor, CLIPModel from transformers import CLIPProcessor as TorchCLIPProcessor, CLIPModel as TorchCLIPModel import numpy as np import mindspore as ms
np_img = np.random.randint(0, 255, (3, 224, 224)) text = "an image of dog"
processor_ms = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") model_ms = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", ignore_mismatched_sizes=True) processed_input_ms = processor_ms(text=text, images=np_img, return_tensors="np") image_features_ms = model_ms.get_image_features(pixel_values=ms.Tensor(processed_input_ms['pixel_values'])) text_features_ms = model_ms.get_text_features(input_ids=ms.Tensor(processed_input_ms['input_ids']), attention_mask=ms.Tensor(processed_input_ms['attention_mask']))
processor_torch = TorchCLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") model_torch = TorchCLIPModel.from_pretrained("openai/clip-vit-large-patch14") processed_input_torch = processor_torch(text=text, images=np_img, return_tensors="pt") image_features_torch = model_torch.get_image_features(pixel_values=processed_input_torch['pixel_values']) text_features_torch = model_torch.get_text_features(input_ids=processed_input_torch['input_ids'], attention_mask=processed_input_torch['attention_mask'])
print("image_features_ms", image_features_ms[0][0:5]) print("image_features_torch", image_features_torch[0][0:5]) print("text_features_ms", text_features_ms[0][0:5]) print("text_features_torch", text_features_torch[0][0:5])`
Expected behavior / 预期结果 (Mandatory / 必填) A clear and concise description of what you expected to happen.
Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
image_features_ms [ 0.19177139 1.2455583 0.85580724 -1.4593458 -1.8108404 ] image_features_torch tensor([-0.0462, 0.7193, 0.8034, 0.6526, -0.0212], grad_fn=<SliceBackward0>) text_features_ms [-1.6040958 1.297238 -0.8092618 1.6023345 -2.6258645] text_features_torch tensor([-0.0300, 0.1251, 0.4510, -0.1160, 0.4756], grad_fn=<SliceBackward0>)
Additional context / 备注 (Optional / 选填) Add any other context about the problem here.