[Feature Request] Explore native image handling method

Open Wendong-Fan opened this issue 5 months ago • 1 comments

Required prerequisites

[x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[ ] Consider asking first in a Discussion.

Motivation

Our primary method for handling image-related tasks is to have an agent execute a tool call. This call delegates the image analysis to a secondary agent or a vision-capable Large Language Model (LLM), which then returns the relevant information.

Optional Solutions for Future Exploration:

An alternative approach would be to send the user's message directly to ChatAgent.step. If an image is detected, a built-in function would be triggered to handle the image input, streamlining the process.
to ensure the conversational message history remains valid, we could programmatically add a placeholder (or "dummy") message to the tool's output.
clone the registered agent with memory of the original agent in ChatAgent.step, let the cloned agent step the image, give response

Solution

No response

Alternatives

No response

Additional context

No response

Jul 22 '25 14:07 Wendong-Fan

@Wendong-Fan hi, I'd like to take this one, maybe some suggestions on where should I start?

Aug 10 '25 11:08 Jack-mi