[Feature Request] Explore native image handling method
Required prerequisites
- [x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [ ] Consider asking first in a Discussion.
Motivation
Our primary method for handling image-related tasks is to have an agent execute a tool call. This call delegates the image analysis to a secondary agent or a vision-capable Large Language Model (LLM), which then returns the relevant information.
Optional Solutions for Future Exploration:
-
An alternative approach would be to send the user's message directly to ChatAgent.step. If an image is detected, a built-in function would be triggered to handle the image input, streamlining the process.
-
to ensure the conversational message history remains valid, we could programmatically add a placeholder (or "dummy") message to the tool's output.
-
clone the registered agent with memory of the original agent in ChatAgent.step, let the cloned agent step the image, give response
Solution
No response
Alternatives
No response
Additional context
No response
@Wendong-Fan hi, I'd like to take this one, maybe some suggestions on where should I start?