rai define final tasks in tool calling benchmark

Is your feature request related to a problem? Please describe. For now we have 30 tasks in Tool calling agent benchmark from categories:

basic (get image from camera topic etc.)
navigation ( navigate somewhere type tasks )
manipulation ( grab / drop something )
spatial reasoning ( answer question about an image )
custom interfaces ( forming and sending the messages of custom interface )

Describe the solution you'd like We need to establish what tasks we want and how many. Please check out current tasks. If you have any suggestion, leave a comment

Describe alternatives you've considered

Additional context

May 12 '25 13:05 jmatejcz

So my main questions:

Should there be more categories?
What categories should have more tasks than others? IMO: basic - so getting something from a right topic are pretty similar so no need for large amount? custom_interfaces - quite similar but we can change what need to be send so there could be more manipulation/ navigation/ spatial_reasoning could be created in large numbers but it would require some specialise verification process so it can take some time
How should we format the prompts, currently we have a mix of very detailed prompts like in in custom_interfaces tasks:

f"You need to publish a message to the topic '{self.topic}' with the text value: '{self.text}'.\n"
"Before publishing, follow these steps:\n"
"1. Use the tool to retrieve the available ROS2 topics and their message types.\n"
f"2. Find the message type for the topic '{self.topic}'.\n"
"3. Retrieve the full message interface definition for that type.\n"
"4. Construct the message filling only the fields you are instructed to. Rest of the fields will have default values.\n"
f"5. Publish the message to '{self.topic}' using the correct message type and interface.\n"

but on the other hand there are prompts like in navigation:

"Move 2 meters to the front."

My proposal to this question is separate tasks for both types of prompts and mark them with different complexity (we have easy, medium, hard), but i m not sure as this was designed to represent how hard the task itself is.

System prompts - how detailed they should be, we have like this for manipulation, with no examples or anything really:

You are a robotic arm with interfaces to detect and manipulate objects.
Here are the coordinates information:
x - front to back (positive is forward)
y - left to right (positive is right)
z - up to down (positive is up).

but this monster for navigation:

"""You are an autonomous robot connected to ros2 environment. Your main goal is to fulfill the user's requests.
    Do not make assumptions about the environment you are currently in.
    You can use ros2 topics, services and actions to operate.

    <rule> As a first step check transforms by getting 1 message from /tf topic </rule>
    <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, so be careful when it is not used. </rule>>
    <rule> be patient with running ros2 actions. usually the take some time to run. </rule>
    <rule> Always check your transform before and after you perform ros2 actions, so that you can verify if it worked. </rule>

    Navigation tips:
    - it's good to start finding objects by rotating, then navigating to some diverse location with occasional rotations. Remember to frequency detect objects.
    - for driving forward/backward or to some coordinates, ros2 actions are better.
    - for driving for some specific time or in specific manner (like shaper or turns) it good to use /cmd_vel topic
    - you are currently unable to read map or point-cloud, so please avoid subscribing to such topics.
    - if you are asked to drive towards some object, it's good to:
        1. check the camera image and verify if objects can be seen
        2. if only driving forward is required, do it
        3. if obstacle avoidance might be required, use ros2 actions navigate_*, but first check your current position, then very accurately estimate the goal pose.
    - it is good to verify using given information if the robot is not stuck
    - navigation actions sometimes fail. Their output can be read from rosout. You can also tell if they partially worked by checking the robot position and rotation.
    - before using any ros2 interfaces, always make sure to check you are using the right interface
    - processing camera image takes 5-10s. Take it into account that if the robot is moving, the information can be outdated. Handle it by good planning of your movements.
    - you are encouraged to use wait tool in between checking the status of actions
    - to find some object navigate around and check the surrounding area
    - when the goal is accomplished please make sure to cancel running actions
    - when you reach the navigation goal - double check if you reached it by checking the current position
    - if you detect collision, please stop operation
    - you will be given your camera image description. Based on this information you can reason about positions of objects.
    - be careful and aboid obstacles

    Here are the corners of your environment:
    (-2.76,9.04, 0.0),
    (4.62, 9.07, 0.0),
    (-2.79, -3.83, 0.0),
    (4.59, -3.81, 0.0)

    This is location of places:
    Kitchen:
    (2.06, -0.23, 0.0),
    (2.07, -1.43, 0.0),
    (-2.44, -0.38, 0.0),
    (-2.56, -1.47, 0.0)

    # Living room:
    (-2.49, 1.87, 0.0),
    (-2.50, 5.49, 0.0),
    (0.79, 5.73, 0.0),
    (0.92, 1.01, 0.0)

    Before starting anything, make sure to load available topics, services and actions.
    Example tool calls:
    - get_ros2_message_interface, args: {'msg_type': 'turtlesim/srv/TeleportAbsolute'}
    - publish_ros2_message, args: {'topic': '/cmd_vel', 'message_type': 'geometry_msgs/msg/Twist', 'message': {linear: {x: 0.5, y: 0.0, z: 0.0}, angular: {x: 0.0, y: 0.0, z: 1.0}}}
    - start_ros2_action, args: {'action_name': '/dock', 'action_type': 'nav2_msgs/action/Dock', 'action_args': {}}
    """

@maciejmajek

May 21 '25 10:05 jmatejcz

As disscussed in person with @maciejmajek i will implement 3 ways of grading the Task:

First the difficulty of the task itseld ( already present ) , currently it is [easy, medium, hard], i will leave it as that.
Difficulty of the prompt (task prompt) so the easy difficulty here would mean that prompt is very descriptive
N shot -> how any examples given in system prompt.

Maybe in the future change the difficulty of the task itself to [trivial, easy, medium, hard, very hard] as it is in manipulation_o3de, so that it is more consistent across benchmarks. However it won't be the same as this benchmark consist of 2 more metrics.

May 26 '25 12:05 jmatejcz

[x] define complexities for tasks
[x] define N_shot for tasks
[x] define prompt_detail for tasks
[x] adjust the benchmark configs and examples
[x] adjust saving results - for different levels of prompt detail save the base prompt
[x] adjust the visualization - plots labels are wrong. Same tasks with different prompts are counted as separate in task detailed analysis
[x] adjust docs
[x] merge mocks from different types of tasks
[x] make more tools available across tasks

May 26 '25 13:05 jmatejcz

define new sets of tasks for:

[x] basic - may require defining more mock topics, also include calling services here, as now its only topics
[x] custom interfaces
[ ] navigation
[x] manipulation
[x] spatial reasoning - may require new images ?

May 26 '25 13:05 jmatejcz

new type of task -> analysis? it would group tasks that require deducing something or answering question based on gathered data from topics, so something like IsSystemHealthyTask, that would require checking some topics and deducing if they are valid based on the message recieved

if anything it would be in thre future PRs as it seem like a lot of work with research and validation to make these tasks like real world ones

May 27 '25 10:05 jmatejcz

add new type of validator -> optional, which will pass when any of given subtasks passed. Usefull when there is couple way of doing the same thing

Jun 03 '25 09:06 jmatejcz

added much more tasks in series of PRs. https://github.com/RobotecAI/rai/pull/620 https://github.com/RobotecAI/rai/pull/656, github.com/RobotecAI/rai/pull/644, https://github.com/RobotecAI/rai/pull/638, https://github.com/RobotecAI/rai/pull/637 https://github.com/RobotecAI/rai/pull/636

spatial was them removed as VLM benchmark was introduced, as well as naivgation which didn't suit this benchmark

Finally all of them are merged to main in this commit -> https://github.com/RobotecAI/rai/commit/baace12eb70bb761a3fdf1aa81dcf3e43aaa9d59

Sep 18 '25 09:09 jmatejcz