aiconfig Add visual-question-answering / multimodal support to gradio notebook tasks

Enjoying the recent gradio notebook stuff!

Was curious about when/if support for an additional hugging face task option of "visual question answering“" is planned?

If not currently planning to add this could a quick overview on how to add a new task category to the gradio notebook codebase (beside just manually reading over the current code for gradio notebooks myself to figure it out on my own which I can do of course but guidance from the team is preferred for best practices in contributing etc)

Mar 03 '24 11:03 Bedrovelsen

Thanks @Bedrovelsen! Would love your help adding that, and messages you on discord so our team can work with you to make sure you can get this set up!

Mar 03 '24 16:03 saqadri

Sounds good

Mar 03 '24 17:03 Bedrovelsen

Just copying over the quick implementation overview from discord here:

A new HuggingFaceVisualQuestionAnsweringRemoteInference ModelParser under https://github.com/lastmile-ai/aiconfig/tree/main/extensions/HuggingFace/python/src/aiconfig_extension_hugging_face/remote_inference_client folder This parser should look pretty similar to the existing HuggingFaceImage2TextRemoteInference model parser, with the following changes:

serialize implementation will do the same image/attachment data stuff but the constructed PromptInput will also need data string representing the 'question' string value from the data passed to serialize
refine_completion_params implementation can be the same, but should have comment pointing to the visual_question_answering api code: https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/inference/_client.py#L1785
deserialize implementation can be mostly the same, except we will need to add 'question' to the completion_data from the prompt data: completion_data["question"] = prompt["data"]
run implementation will be similar as well, just needs to call client.visual_question_answering with the completion_data and need to handle the response as desired. It looks like the response will be a list of VisualQuestionAnsweringOutputElement objects; we'll want to serialize those as ExecuteResult outputs in the format you think is best. For example, we could have data be the answer and then store the score in metadata

I believe the helpers about validating/retrieving the image from attachments can just be kept the same.

With the parser implemented, we can expose it in the extension here: https://github.com/lastmile-ai/aiconfig/blob/main/extensions/HuggingFace/python/src/aiconfig_extension_hugging_face/init.py

For testing the extension, please see README instructions - https://github.com/lastmile-ai/aiconfig/blob/main/extensions/HuggingFace/python/README.md

Then, I would recommend importing and registering the new parser in https://github.com/lastmile-ai/aiconfig/blob/main/cookbooks/Gradio/aiconfig_model_registry.py with id "Visual Question Answering" and then following the Getting Started instructions in https://github.com/lastmile-ai/aiconfig/edit/main/cookbooks/Gradio/README.md to open the huggingface.aiconfig.json file with the new parser registered.

On the UI side, we will need to add a new PromptSchema to the client for rendering the parser's input and settings nicely. I can implement that shortly

Mar 04 '24 21:03 rholinshead

Whoops, linked #1396 which has the schema changes and it auto-closed. This issue is still open

Mar 04 '24 22:03 rholinshead

aiconfig aiconfig copied to clipboard

Add visual-question-answering / multimodal support to gradio notebook tasks

aiconfig
aiconfig copied to clipboard