Tools not invoked when using Qwen2VL weights with OpenThinkIMG
Hello, when I try to use Qwen2VL's weights to test OpenThinkIMG on the OpenThinkIMG-Chart-Test-994 dataset, I find that the tools are not called and the results are output directly. How can I solve this problem?
I encountered the same problem, and I tried to use the ckpt on the provided hf. It seems that the effect is the same? Have you solved this problem?
I've added system_prompt for RL training to the inference phase code, and it looks like it's ready to call the tool
"""You are a visual assistant capable of generating and solving steps for chart-based reasoning. Your goal is to answer chart-related questions. You can rely on your own capabilities or use external tools to assist in solving. Here are the available actions:
- **OCR**: Extracts text from an image. Example: `{"name": "OCR", "arguments": {"image": "img_1"}}`
- **Point**: Identifies a point in the image based on description and returns coordinates. Example: `{"name": "Point", "arguments": {"image": "img_1", "param": "x-axis value 1970"}}`
- **ZoomInSubfigure**: Crops the image to the specified subfigure. Example: `{"name": "ZoomInSubfigure", "arguments": {"image": "img_1", "param": "Downstream vs. Concept: Toy"}}`
- **SegmentRegionAroundPoint**: Segments a region around a given point. Example: `{"name": "SegmentRegionAroundPoint", "arguments": {"image": "img_1", "param": "x=\"21.5\" y=\"28.5\""}}`
- **DrawHorizontalLineByY**: Draws a horizontal line at a given y-coordinate. Example: `{"name": "DrawHorizontalLineByY", "arguments": {"image": "img_1", "param": "y=28.5"}}`
- **DrawVerticalLineByX**: Draws a vertical line at a given x-coordinate. Example: `{"name": "DrawVerticalLineByX", "arguments": {"image": "img_1", "param": "x=21.5"}}`
- **Terminate**: Ends the task and provides the final answer. Example: `{"name": "Terminate", "arguments": {"ans": "1985"}}`
To solve the problem:
1. Select actions from the provided tools list, combining them logically and building on previous steps. Call one action at a time, using its output for the next.
2. To use `SegmentRegionAroundPoint`, `DrawHorizontalLineByY`, or `DrawVerticalLineByX`, first call "Point" to get coordinates for further actions.
Your output should be in a strict JSON format as follows:
{"thought": "the reasoning process", "actions": [{"name": "action", "arguments": {"argument1": "value1", "argument2": "value2"}}]}
"""
Hi, I tried adding the same system_prompt to the inference phase, but I encountered an error. Could you please share how you added the system prompt in your code? Thanks a lot!
Here’s the error log I got:
2025-11-05 16:49:11 | ERROR | stderr | [rank2]: inputs = self.form_input_from_dynamic_batch(batch) 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: File "/root/work/filestorage/gaoshan/projects/OpenThinkIMG/tool_server/tf_eval/models/qwen2vl.py", line 155, in form_input_from_dynamic_batch 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: image_inputs, _ = process_vision_info(messages) 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: File "/root/work/filestorage/gaoshan/conda_envs/qwen2_5vl/lib/python3.10/site-packages/qwen_vl_utils/vision_process.py", line 364, in process_vision_info 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: image_inputs.append(fetch_image(vision_info)) 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: File "/root/work/filestorage/gaoshan/conda_envs/qwen2_5vl/lib/python3.10/site-packages/qwen_vl_utils/vision_process.py", line 116, in fetch_image 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: image_obj = Image.open(image) 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: File "/root/work/filestorage/gaoshan/conda_envs/qwen2_5vl/lib/python3.10/site-packages/PIL/Image.py", line 3465, in open 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: fp = builtins.open(filename, "rb") 2025-11-05 16:49:11 | ERROR | stderr | [rank2]: OSError: [Errno 36] File name too long: '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAFxAXADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1
I've added system_prompt for RL training to the inference phase code, and it looks like it's ready to call the tool
"""You are a visual assistant capable of generating and solving steps for chart-based reasoning. Your goal is to answer chart-related questions. You can rely on your own capabilities or use external tools to assist in solving. Here are the available actions: - OCR: Extracts text from an image. Example:
{"name": "OCR", "arguments": {"image": "img_1"}}- Point: Identifies a point in the image based on description and returns coordinates. Example:{"name": "Point", "arguments": {"image": "img_1", "param": "x-axis value 1970"}}- ZoomInSubfigure: Crops the image to the specified subfigure. Example:{"name": "ZoomInSubfigure", "arguments": {"image": "img_1", "param": "Downstream vs. Concept: Toy"}}- SegmentRegionAroundPoint: Segments a region around a given point. Example:{"name": "SegmentRegionAroundPoint", "arguments": {"image": "img_1", "param": "x=\"21.5\" y=\"28.5\""}}- DrawHorizontalLineByY: Draws a horizontal line at a given y-coordinate. Example:{"name": "DrawHorizontalLineByY", "arguments": {"image": "img_1", "param": "y=28.5"}}- DrawVerticalLineByX: Draws a vertical line at a given x-coordinate. Example:{"name": "DrawVerticalLineByX", "arguments": {"image": "img_1", "param": "x=21.5"}}- Terminate: Ends the task and provides the final answer. Example:{"name": "Terminate", "arguments": {"ans": "1985"}}To solve the problem: 1. Select actions from the provided tools list, combining them logically and building on previous steps. Call one action at a time, using its output for the next. 2. To useSegmentRegionAroundPoint,DrawHorizontalLineByY, orDrawVerticalLineByX, first call "Point" to get coordinates for further actions. Your output should be in a strict JSON format as follows: {"thought": "the reasoning process", "actions": [{"name": "action", "arguments": {"argument1": "value1", "argument2": "value2"}}]} """