colpali icon indicating copy to clipboard operation
colpali copied to clipboard

Align internal methods with other vision models, e.g. Clip. Improvements to the AutoProcessor

Open michaelfeil opened this issue 1 year ago • 2 comments
trafficstars

Currently, many things are hacked around, which makes colpali-engine frustrating to build on.

The abstractions of the following need to be improved:

  • tokenization via AutoProcessor. Queries cannot be processed alone, and need to be padded with dummy-images
  • forward pass processes images and text. However, joining the results of the
  • text and images cannot be processed in the same forward call. There is e.g. a separate forward pass for all Clip models.

Beyond:

  • poetry is pinning accelerate to a certain version, this is however only needed for train. This makes the adoption of colipali-engine hard to use in upstream projects, if e.g. only one of them is needed.

Also, if possible upstream your library implementations into hugging-face transformers. Integrating image models is currently a tooling mess and makes transformers landscape in open source really unpleasant to work with.

Currently blocking the adoption into e.g. https://github.com/michaelfeil/infinity

michaelfeil avatar Oct 09 '24 18:10 michaelfeil

Accelerate is an optional dependency that is only installed if you need training here

Agreed that the mock image is suboptimal, note that it is not done in ColQwen2 which is probably the model to use nowadays

Not sure what your point is about the separate forward passes, we have to do one for the query and one for the document ? If you are talking about document processing which goes through a vision encoder then the language model, it's a sequential operation that needs to be done.

Ok for the versionning, currently the only not up to date packages are numpy and peft but we will bump.

More generally, I am happy to make changes to improve the repo but would help to have clearer explanations of what you want - this lib is not designed to be fully compatible with everything, and best is probably to use with HF with a trust_remote_code=True flag

ManuelFay avatar Oct 09 '24 20:10 ManuelFay

You can try with this (if peft is installed):

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("manu/colqwen2-v0.1-hf", 
                                  torch_dtype=torch.bfloat16,
                                  device_map="cuda:0",
                                  trust_remote_code=True)

processor = AutoProcessor.from_pretrained("manu/colqwen2-v0.1-hf", trust_remote_code=True)

ManuelFay avatar Oct 09 '24 22:10 ManuelFay

@ManuelFay The reason why I open this issue, is that the API is derivating from the huggingface interface, but for no good reason.

To integrate colpali, I had to refactor a class that implements forward passes. https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/transformer/vision/torch_vision.py

e.g. it would be helpful to add

image_embeds: "Tensor" = self.model.get_image_features(  
                        pixel_values=features.get("pixel_values"),
                    )

https://github.com/michaelfeil/infinity/blob/62a07c9d91b8bddb999001277563dbbde24844d4/libs/infinity_emb/infinity_emb/transformer/vision/torch_vision.py#L190C13-L209C22

michaelfeil avatar Oct 20 '24 18:10 michaelfeil

Ok I get it, you mean that it's not the same as other contrastive vision models such as CLIP where they have special forward functions for images and texts. In our case, we just call forward and depending on the arguments, the model embeds just the text, just the image, or both.

It is consistent however with the implementation of generative VLMs such as Qwen2-VL, Llava, Idefics3 etc... This actually makes sense because these models can take "both" modalities at the same time, text and images. The same model deals with all inputs, and not 2 different models like you would have in CLIP.

Having a unified forward that accepts everything enables concatenating text tokens to the image tokens to add metadata/instructions/text input for example to the document embeddings. I personally think forcing the "get_image_features" would be restrictive at terms because we couldn't guarantee API consistency indefinitely with CLIP for example.

I guess that's up for debate, would be possible to patch this for now but I think that's not the way to go.

The plan is to merge this in HuggingFace anyways rather soon, so we will see what is best in that PR !

Thanks a ton for your time and contributions,

Manu

ManuelFay avatar Oct 20 '24 19:10 ManuelFay

Hey @ManuelFay Thanks for your input - looking forward to the PR with huggingface.

FYI, this is now working for deployment behind a RestAPI.

port=7997
model1=michaelfeil/colqwen2-v0.1
volume=$PWD/data

docker run -it --gpus all \
 -v $volume:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:0.0.66 \
 v2 \
 --model-id $model1 \
 --port $port --batch-size 2

michaelfeil avatar Oct 20 '24 19:10 michaelfeil