[Go] How can the tool return multiple model messages to the model? For example, when the model calls the search tool to search for image scenes.
I have examined the implementation. The current tool returns a structure that is serialized into binary data and then sent back to the model.
@apascal07 hello, I'm going to modify the request through the middleware, but this feels a bit strange. Is there a better way
Hi Eric, can you give a bit more detail here as to what you're trying to do?
@apascal07 The project I'm currently working on involves generating PPTs using AI. One of the steps is that the llm uses an image search tool to search for images and then determines whether the images are related to the PPT topic. If they are, they will be used for PPT creation. I hope the image search tool can directly return the images to the large model, because currently any tool's return result is serialized into a string, rather than something like a user message, The LLM cannot read the images within the string.
That's an interesting use case, you're right in that there isn't direct support for this. The way you're doing it with middleware is one way, another way would be to define your search tool as an interrupt and then manually handle the interrupt by injecting a message saying to see the image below and insert the image as a separate message/part.
@apascal07 May I ask if there are any plans to support multimodal tools? The scenarios considered include:
- Browser tools: In the front-end development scenario, it is necessary to return the screenshot to the model for viewing the rendered screenshot.
- Search tool: In creative scenarios, it is necessary to retrieve and display the images, videos, etc. that have been found.
- Computer Usage: In a more general scenario of agent development, the computer is made available for the model to use. The model often needs to view the UI within the current window.
- Mobile usage: When the mobile phone is used by the model, similar scenarios will also occur.
- App operation: Enable the model to assist users in operating the app. The model also needs to view the user interface.
@apascal07 Another issue is the compatibility of MCP. Currently, the content returned by the MCP tool includes results such as images and videos. If the framework requires users to figure out how to handle the multimodal content returned by the tool, I think this framework has some significant flaws. After all, this application scenario is very broad.
We filed this issue: https://github.com/firebase/genkit/issues/3692
It will have to be supported at the plugin level. Gemini can support this but not all providers can, so we will start there. We will need to figure out a way to simulate it for those that do not.
There's no timeline for this work yet but it should get prioritized soon.
@apascal07 Great! I have an idea: first, we'll simply support it by allowing the tool to return []*ai.Message. If the returned value is []*ai.Message, then directly append it to the message array without any further processing.
@apascal07 hello,May I ask if there is any progress。
@hugoaguirre @apascal07 Is there no maintenance of the go version of genkit? It seems that there has been no version release or bug fixes for a long time
Much of the team has been out of office last few weeks. We'll talk a look at this feature shortly.