Multimodal support?
Great work! I would like to know whether this framework supports multimodal input from agents. For example, could it handle image and text responses from agents (perhaps similar to OpenAI o3)?
We have people working on an example. There is currently no blocking issue.
Thanks! By the way, I'd like to ask whether this example integrates the image tokens into the rollout trajectory, the base model can use these image tokens to generate reasoning for the next step, i.e., think with images.
Really looking forward to the support for multimodal.
Excuse me, but I'd like to know the timelines for supporting multimodal like Qwen3-VL, as I am looking for a framework to finish one work involving multimodal ReAct. Do you plan to support them in the near future?
Really looking forward to the support for multimodal.
mark
expect