amica icon indicating copy to clipboard operation
amica copied to clipboard

Need to rework Multimodal pipeline for ollama, maybe for other APIs as well

Open andyccliao opened this issue 1 year ago • 3 comments

In reworking Ollama support for LLaVa, I found the pipeline for Multimodal chats to be unnecessary. Ollama only requires the image to be attached to the message being sent. The multimodal model takes care of the rest.

Currently the vision chat pipeline seems to be for separate Vision/CLIP models and LLMs (one describes the picture, and then the returned result is put into an LLM).

andyccliao avatar Jan 07 '24 01:01 andyccliao

Any thoughts on what to do for the Vision prompt? It seems to be unnecessary.

I was thinking the default behavior will be to use the normal System prompt, and the Vision prompt could be hidden behind a checkbox, i.e. turn it into a custom Vision prompt.

andyccliao avatar Jan 19 '24 23:01 andyccliao

Also, was the desired UX for vision to be:

  1. Press the "Take Picture" button to append the picture to the next message, then send the message to get a response.
  2. Press the "Take Picture" button to send the picture and get a response, without sending any text to accompany the picture. (Still sending the rest of the chat history, just no new text.)
  3. Type something in the chatbox, then press the "Take Picture" button to send both the picture and message at the same time.

The way the UX works right now, it works closest to to option 2, but it appends the response from the vision model to the previous chat message. (By the way, the text in the chatbox gets completely discarded.)

I think the best way to make Amica be as similar to chatting as possible is to allow both 2 and 1. My chat habits are often to send an image and then type something quickly, send an image alone, or append images to my message before sending.

When I was trying to rewrite the vision pipeline, I ran into trouble deciding how it should be implemented, and I realized that it would depend on the UX. So, any opinions on this matter?

andyccliao avatar Jan 23 '24 01:01 andyccliao

Arbius has a $200 AIUS bounty for this issue!

Brief: Complete the desired UX vision to allow take picture + text and get response as well. Rework pipeline as outlined.

Please read carefully:

To begin work on a bounty, reply by saying “I claim this bounty” - you will have 48 hours to submit your PR before someone else may attempt to claim this bounty.

To complete the bounty, within 48 hours of claiming, reply with a link to your PR referencing this issue and an Ethereum address. You must comply with reviewers comments and have the PR merged to receive the bounty reward. Please be sure to focus on quality submissions to minimize the amount of time reviewers must take.

slowsynapse avatar Feb 18 '24 21:02 slowsynapse