dspy Pydantic vision module to build off of and enable multimodal capabilities for all providers.

Pydantic vision module to build off of and enable multimodal capabilities for all providers.

Open sebbyjp opened this issue 10 months ago • 7 comments

Hi all,

This library is amazing! I added something similar to the pydantic Image class we use to abstract away all those messy conversions you need to do.

Question, why poetry? I am team hatch all the way and not sure what to do for poetry.lock file.

The problem with multimodality, however is that joblib won't cache correctly unless we encode to png (which is easy to do with the vision.py module but a better use would be "close enough" similarity maybe via CLIP embeddings?)

If you all like, I can integrate more of our backend api such as a Message class that has to_openai(), to_anthropic() etc. methods to abstract away all of those api specifics and allow for interwoven image and text inputs.

Cheers!

Apr 09 '24 06:04 sebbyjp

So after testing in the main branch, this also fixes some circular import bugs that aren't being tested currently.

Apr 09 '24 10:04 sebbyjp

really interesting to see vision here, would be very exciting to see it used in context, for example with an evaluation function and example

You need only wait by end of day, application is in-context learning for robotics :)

Apr 09 '24 15:04 sebbyjp

There is a duplicate PR with #682. I have no personal preference for either one, and it seems like yours is aimed with more modalities in mind.

Tagging @KCaverly @CyrusOfEden to see if we should delay this until backend refactor

Apr 10 '24 00:04 isaacbmiller

I believe the LiteLLM backend should interface with gptv, but I imagine we will need to accomodate for this with a distinct backend, so more work would be required, but we should be able to leverage this work to speed that up.

Apr 10 '24 02:04 KCaverly

Added examples/multimodal/visual_question_answering.ipynb and tested api calls.

Apr 11 '24 02:04 sebbyjp

@isaacbmiller Could you provide more details about the plans for the backend refactor?

To get multimodal Calude and HF I wanted to add the following Pydantic models with to_() methods:

Message role: str content: list[Prompt]

Prompt text: str image: Image

We do this internally but there are probably better design approaches.

Apr 11 '24 16:04 sebbyjp

@sebbyjp @arnavsinghvi11 I'm eager to test this. Is it possible to compile datasets with image inputs at this stage, to perform prompt optimisation?

May 17 '24 19:05 antoan

dspy dspy copied to clipboard

Pydantic vision module to build off of and enable multimodal capabilities for all providers.

dspy
dspy copied to clipboard