waymo-open-dataset Getting different numbers when running the motion metrics functions locally than when submitted to Waymo Servers

*Note this is for the 2022 challenge. We are working on the 2022 dataset and just as a sanity check we made the predictions the ground truth values given to us in the validation set (non-interative part). However when I use the function given to us, I get the following numbers:

On the validation dashboard though, I get these numbers

Can I get some insight as to what is going on? Is there any difference between how metrics are computed in the cloud vs in the function provided? Also I expected these metrics to be perfect since they are ground truth values, but seems like they are not. I believe it's because not all ground truth values are observed, so in that case there is a -1 for some timesteps which would create large errors depending on the coordinate system. But why are we even scoring for timesteps for which ground truth isn't even observed?

Sep 28 '23 20:09 Shashank-Ojha

Hi thanks for raising this. Could you tell me what email address you used to submit this? We will take a look at the submission.

Oct 05 '23 16:10 scott-ettinger

[email protected]

Oct 05 '23 19:10 Shashank-Ojha

It was the submission that was done at this time 9/28/23 12:25 PM

Oct 05 '23 23:10 Shashank-Ojha

Looking at your submission, it looks like there are predictions with all -1s in the trajectory for agents which have valid data in the ground truth. For an example, in the scenario with ID=6bdceecbf2416202, the agent with ID 3579 has valid data for all time steps in the ground truth but your submission data is all -1 for the predictions. The agent was verified to have valid data in both the tf.Example and the Scenario proto data.

Could you verify on your end to see if the ground truth data for this example matches the data in your submission?

Oct 07 '23 00:10 scott-ettinger

Do you know which file out of validation-XXXXX-of_00150 that scenario exists in? It would help me find that particular scenario in the data.

Oct 07 '23 01:10 Shashank-Ojha

Nevermind, I found it. You're right that there was a small bug in my submission. Let me resubmit and retry.

Oct 09 '23 15:10 Shashank-Ojha

Good to hear you found it. Note that you will likely get a non-zero overlap as the bounding box estimates have some noise and there is some overlap in the ground truth for objects that are very close together (often groups of pedestrians).

Oct 09 '23 16:10 scott-ettinger

@scott-ettinger, we tried to make another submission just to ensure local and cloud match for sanity check purposes but we're getting some server failures

http://type.googleapis.com/util.ErrorSpacePayload='RPC::UNREACHABLE'

Do you think you can check what the issue is?

Oct 10 '23 18:10 Shashank-Ojha

I'll take a look. Sorry for the trouble.

Oct 10 '23 18:10 scott-ettinger

This was a temporary issue with the server. Could you try your submission again?

Oct 10 '23 21:10 scott-ettinger

@scott-ettinger thanks for the quick response in resolving the issue!

So now we've successfully submitted the ground truth predictions to the Waymo servers and the results look as expected:

However, these are the results I see when I ran locally: Screenshot 2023-10-11 at 11 03 50 AM

It's good to see that minADE, minFDE, and missRate match what we expect. Overlap Rate and mAP seem to be a bit off though. You already mentioned that Overlap Rate is not expected to be zero since pedestrians might be really close that they seem to overlap. I'm not sure what's going on with mAP though. Looking closer at the results, there are some tf examples where mAP is 0, so this is why we don't get an average of 1.0 in my local results.

Oct 11 '23 18:10 Shashank-Ojha

Could you tell me how you are running the metrics locally?

Oct 11 '23 20:10 scott-ettinger

Yeah, so I import py_metrics_ops from the WOMD github repo and call the motion_metrics function. I do this for each tf_example separately (i.e. Batch size is 1). I then manually compute average (ignoring -1's in the output which indicate that there was no veh/ped/cyc in that scene so that metric was not computed). I don't think the averaging is the issue because I see instances where the single tf example doesn't have perfect mAP.

Below is the code I call:

    from waymo_open_dataset.metrics.ops import py_metrics_ops

    (min_ade, min_fde, miss_rate, overlap_rate,
     mean_average_precision) = py_metrics_ops.motion_metrics(
         prediction_trajectory=prediction_trajectory,
         prediction_score=prediction_score,
         ground_truth_trajectory=ground_truth_trajectory,
         ground_truth_is_valid=ground_truth_is_valid,
         object_type=object_type,
         object_id=object_id,
         prediction_ground_truth_indices=prediction_ground_truth_indices,
         prediction_ground_truth_indices_mask=
         prediction_ground_truth_indices_mask,
         config=config.SerializeToString(),
         scenario_id=scenario_id,
     )

Perhaps the issue is that one of the arguments I'm passing to this function is wrong. Let me walk through how I construct each of the inputs and please let me know if something is wrong.

The prediction_trajectory and prediction_score are exactly what we submit to the server. The order of the agents is exactly the same order as provided in the tf_example.

I construct ground_truth_trajectory using the following code:

def extract_gt_trajectory(tf_example_original):
    """Extracts the ground truth trajectory for the all tracks. The
    returned trajectory is in the frame that the original
    data is provided in.

    tf_example_original: str -> tensor dict containing the original features
        from the Waymo Open Motion Dataset.
    Returns:
        gt_traj: shape [1, waymo_constants.NUM_OBJECTS, TG, 7]
            The 7 states variables are:
              [x, y, length, width, heading, velocity_x, velocity_y]
    |TG| is the number of steps in the groundtruth track. It includes the past,
    current, and future states.
    |length| is the length of the bounding box around the track in meters. This
    refers to the axis in the direction of the track.
    |width| is the width of the bounding box around the track in meters. This
    refers to the axis that is parallel to the axle's of the track.
    |heading| is the angle of the track in radians.
    """
    # Shape = [waymo_constants.NUM_OBJECTS, TG]
    track_x_smooth = tf.concat([
        tf_example_original["state/past/x"],
        tf_example_original["state/current/x"],
        tf_example_original["state/future/x"],
    ],
                               axis=1)

    track_y_smooth = tf.concat([
        tf_example_original["state/past/y"],
        tf_example_original["state/current/y"],
        tf_example_original["state/future/y"],
    ],
                               axis=1)

    track_length = tf.concat([
        tf_example_original["state/past/length"],
        tf_example_original["state/current/length"],
        tf_example_original["state/future/length"],
    ],
                             axis=1)
    track_width = tf.concat([
        tf_example_original["state/past/width"],
        tf_example_original["state/current/width"],
        tf_example_original["state/future/width"],
    ],
                            axis=1)

    track_heading = tf.concat([
        tf_example_original["state/past/bbox_yaw"],
        tf_example_original["state/current/bbox_yaw"],
        tf_example_original["state/future/bbox_yaw"],
    ],
                              axis=1)

    track_vx_smooth = tf.concat([
        tf_example_original["state/past/velocity_x"],
        tf_example_original["state/current/velocity_x"],
        tf_example_original["state/future/velocity_x"],
    ],
                                axis=1)
    track_vy_smooth = tf.concat([
        tf_example_original["state/past/velocity_y"],
        tf_example_original["state/current/velocity_y"],
        tf_example_original["state/future/velocity_y"],
    ],
                                axis=1)

    # Shape = [waymo_constants.NUM_OBJECTS, TG, 1]
    track_x_smooth = tf.expand_dims(track_x_smooth, axis=-1)
    track_y_smooth = tf.expand_dims(track_y_smooth, axis=-1)
    track_length = tf.expand_dims(track_length, axis=-1)
    track_width = tf.expand_dims(track_width, axis=-1)
    track_heading = tf.expand_dims(track_heading, axis=-1)
    track_vx_smooth = tf.expand_dims(track_vx_smooth, axis=-1)
    track_vy_smooth = tf.expand_dims(track_vy_smooth, axis=-1)

    # Shape = [waymo_constants.NUM_OBJECTS, TG, 7]
    gt_traj =  tf.concat([
        track_x_smooth, track_y_smooth, track_length, track_width,
        track_heading, track_vx_smooth, track_vy_smooth
    ],
                     axis=2)


   # Shape = [1, waymo_constants.NUM_OBJECTS, TG, 7]
    gt_traj = tf.expand_dims(gt_traj, axis=0)

    return gt_traj

The ground_truth_is_valid is generated in a similar way:

    # Shape = [waymo_constants.NUM_OBJECTS, TG]
    gt_is_valid = tf.concat([
        tf_example_original["state/past/valid"],
        tf_example_original["state/current/valid"],
        tf_example_original["state/future/valid"],
    ],
                     axis=1)
   # Shape = [1, waymo_constants.NUM_OBJECTS, TG]
    gt_is_valid = tf.expand_dims(gt_is_valid, axis=0)

The object_type and object_id are pretty straightforward:

    # Shape = [1, waymo_constants.NUM_OBJECTS]
    obj_ids = tf.expand_dims(tf_example_original["state/id"], axis=0)

    # Shape = [1, waymo_constants.NUM_OBJECTS]
    waymo_obj_types = tf.expand_dims(tf_example_original["state/type"], axis=0)

The prediction_ground_truth_indices is just an identity vector mapping, where waymo_constants.NUM_OBJECTS = 128

tf.reshape(tf.range(waymo_constants.NUM_OBJECTS, dtype=tf.int64), [1, waymo_constants.NUM_OBJECTS, 1])

prediction_ground_truth_indices_mask is just bunch of ones of the same shape as prediction_ground_truth_indices. The reason this works is because we later filter out all predictions that we are not required to predict. More info about this just below.

pred_gt_indices_mask = tf.ones_like(pred_gt_indices, dtype=bool)

We then filter out

   tracks_to_predict = tf.boolean_mask(
         tf.range(waymo_constants.NUM_OBJECTS, dtype=tf.int64),
         tf_example_original["state/tracks_to_predict"] == 1)
 
   # Shapes remain the same.
    pred_traj = tf.gather(pred_traj, tracks_to_predict, axis=1)
    pred_gt_indices = tf.gather(pred_gt_indices, tracks_to_predict, axis=1)
    pred_gt_indices_mask = tf.gather(
        pred_gt_indices_mask, tracks_to_predict, axis=1)
    pred_score = tf.gather(pred_score, tracks_to_predict, axis=1)

The config is taken straight from the WOMD codebase:

def get_challenge_config():
    """Returns the config used by the Official Waymo Challenge."""
    config = motion_metrics_pb2.MotionMetricsConfig()
    config_text = """
    track_steps_per_second: 10
    prediction_steps_per_second: 2
    track_history_samples: 10
    track_future_samples: 80
    speed_lower_bound: 1.4
    speed_upper_bound: 11.0
    speed_scale_lower: 0.5
    speed_scale_upper: 1.0
    step_configurations {
      measurement_step: 5
      lateral_miss_threshold: 1.0
      longitudinal_miss_threshold: 2.0
    }
    step_configurations {
      measurement_step: 9
      lateral_miss_threshold: 1.8
      longitudinal_miss_threshold: 3.6
    }
    step_configurations {
      measurement_step: 15
      lateral_miss_threshold: 3.0
      longitudinal_miss_threshold: 6.0
    }
    max_predictions: 6
    """
    text_format.Parse(config_text, config)
    return config

The scenario_id is just

tf_example_original['scenario/id']

One final comment. Based on the WOMD documentation here:

    - B: batch size. Each batch should contain 1 scenario.
    - M: Number of joint prediction groups to predict per scenario.
    - N: number of agents in a joint prediction. 1 if mutual independence is
        assumed between agents.
    - K: top_K predictions per joint prediction.
    - A: number of agents in the groundtruth.
    - TP: number of steps to evaluate on. Matches len(config.step_measurement).
    - TG: number of steps in the groundtruth track. Matches
        config.track_history_samples + 1 + config.future_history_samples.
    - BR: number of breakdowns.

In my setup:

B = 1 since I only pass in 1 scenario/tf example at a time
M = Number of agents to predict (at most 8) since each prediction group consists of just 1 agent so the number of groups is the same as the number of agents to predict.
N = 1 because we are not doing joint prediction
K = 6 because we output 6 predictions (for the ground truth, I pass in 6 identical trajectories and put all the probability mass on the 1st. This seemed to work fine as the Waymo servers gave me the expected results)
A = 128 (I'm assuming non-valid agents don't matter since we provide the mapping from the prediction agent to the ground truth agent)
TP: 16 since we predict 16 points. (8 seconds of future, 0.5 second intervals apart)
TG = 10 + 1 + 80 = 91

Let me know if the above is correct.

I apologize for the long post, but I hope this explains my setup. Please let me know if anything is not clear. And thank you in advance for the support!

Oct 12 '23 20:10 Shashank-Ojha

Apologies for the late reply. It is difficult to determine what might be going wrong here. We will take a look at the tf wrapper code to see if we find any issues.

Oct 18 '23 21:10 scott-ettinger

Sure, let me know if I can provide any other information to assist in the process!

Oct 18 '23 21:10 Shashank-Ojha

@scott-ettinger any updates on this?

Nov 02 '23 23:11 Shashank-Ojha

Sorry for the delay. We weren't able to find the issue on our side. Are you able to use the validation server as a workaround until this is resolved?

Nov 07 '23 18:11 scott-ettinger

Yeah, the validation server works fine for now, but what is the plan to resolve it? Is this still something you all are looking into or are you closing the case?

Nov 07 '23 21:11 Shashank-Ojha

Thanks for your patience. This requires a bit of debugging. I will see if we can look into this further.

Nov 10 '23 02:11 scott-ettinger

Just to add some more details to the issue. We noticed that computing the mAP between ground truth predictions fails (outputs 0.0 instead of 1.0) when there are invalid states in the ground truth track.

We can actually reproduce this locally if you look at the 6th scenario in the file waymo_open_dataset_motion_v_1_1_0/scenario/validation/validation.tfrecord-00149-of-00150. Note this is the 2022 dataset.

Specifically when there is an invalid state, this function returns null-opt: https://github.com/waymo-research/waymo-open-dataset/blob/master/src/waymo_open_dataset/metrics/motion_metrics.cc#L269. And this function is called during the calculation of mAP. The state is determined here: https://github.com/waymo-research/waymo-open-dataset/blob/master/src/waymo_open_dataset/metrics/ops/utils.cc#L354

Nov 17 '23 17:11 Shashank-Ojha

Apologies we have not resolved this. The TF-ops wrapper code is old and a bit difficult to maintain. I would like to get your feedback. For our next release we could provide a way to compute metrics locally for the validation set with a function that takes a submission file as input. Would this work for your workflow, or would you require a tensorflow based solution?

Jan 26 '24 21:01 scott-ettinger

I think a local script that takes the submission file as input should be sufficient

Jan 29 '24 22:01 Shashank-Ojha

require a tensorflow based solution is better in my project as I would not like to create Submission object/file if I want to use the output of the model directly to evaluate

Feb 19 '24 21:02 pengzhenghao

Hi I am also experiencing the same issue. Can we draw the conclusion that this issue is from the tf ops wrapper for those metrics? So we should only use the minADE/FDE/MR in the local evaluation. If we want to know the mAP we should stick to the result reported by the waymo eval server?

Screenshot from 2024-02-23 11-27-41

Feb 23 '24 19:02 pengzhenghao

waymo-open-dataset waymo-open-dataset copied to clipboard

Getting different numbers when running the motion metrics functions locally than when submitted to Waymo Servers

waymo-open-dataset
waymo-open-dataset copied to clipboard