waymo-open-dataset
waymo-open-dataset copied to clipboard
Getting different numbers when running the motion metrics functions locally than when submitted to Waymo Servers
*Note this is for the 2022 challenge. We are working on the 2022 dataset and just as a sanity check we made the predictions the ground truth values given to us in the validation set (non-interative part). However when I use the function given to us, I get the following numbers:
On the validation dashboard though, I get these numbers
Can I get some insight as to what is going on? Is there any difference between how metrics are computed in the cloud vs in the function provided? Also I expected these metrics to be perfect since they are ground truth values, but seems like they are not. I believe it's because not all ground truth values are observed, so in that case there is a -1 for some timesteps which would create large errors depending on the coordinate system. But why are we even scoring for timesteps for which ground truth isn't even observed?
Hi thanks for raising this. Could you tell me what email address you used to submit this? We will take a look at the submission.
It was the submission that was done at this time 9/28/23 12:25 PM
Looking at your submission, it looks like there are predictions with all -1s in the trajectory for agents which have valid data in the ground truth. For an example, in the scenario with ID=6bdceecbf2416202, the agent with ID 3579 has valid data for all time steps in the ground truth but your submission data is all -1 for the predictions. The agent was verified to have valid data in both the tf.Example and the Scenario proto data.
Could you verify on your end to see if the ground truth data for this example matches the data in your submission?
Do you know which file out of validation-XXXXX-of_00150 that scenario exists in? It would help me find that particular scenario in the data.
Nevermind, I found it. You're right that there was a small bug in my submission. Let me resubmit and retry.
Good to hear you found it. Note that you will likely get a non-zero overlap as the bounding box estimates have some noise and there is some overlap in the ground truth for objects that are very close together (often groups of pedestrians).
@scott-ettinger, we tried to make another submission just to ensure local and cloud match for sanity check purposes but we're getting some server failures
http://type.googleapis.com/util.ErrorSpacePayload='RPC::UNREACHABLE'
Do you think you can check what the issue is?
I'll take a look. Sorry for the trouble.
This was a temporary issue with the server. Could you try your submission again?
@scott-ettinger thanks for the quick response in resolving the issue!
So now we've successfully submitted the ground truth predictions to the Waymo servers and the results look as expected:
However, these are the results I see when I ran locally:
It's good to see that minADE, minFDE, and missRate match what we expect. Overlap Rate and mAP seem to be a bit off though. You already mentioned that Overlap Rate is not expected to be zero since pedestrians might be really close that they seem to overlap. I'm not sure what's going on with mAP though. Looking closer at the results, there are some tf examples where mAP is 0, so this is why we don't get an average of 1.0 in my local results.
Could you tell me how you are running the metrics locally?
Yeah, so I import py_metrics_ops
from the WOMD github repo and call the motion_metrics
function. I do this for each tf_example separately (i.e. Batch size is 1). I then manually compute average (ignoring -1's in the output which indicate that there was no veh/ped/cyc in that scene so that metric was not computed). I don't think the averaging is the issue because I see instances where the single tf example doesn't have perfect mAP.
Below is the code I call:
from waymo_open_dataset.metrics.ops import py_metrics_ops
(min_ade, min_fde, miss_rate, overlap_rate,
mean_average_precision) = py_metrics_ops.motion_metrics(
prediction_trajectory=prediction_trajectory,
prediction_score=prediction_score,
ground_truth_trajectory=ground_truth_trajectory,
ground_truth_is_valid=ground_truth_is_valid,
object_type=object_type,
object_id=object_id,
prediction_ground_truth_indices=prediction_ground_truth_indices,
prediction_ground_truth_indices_mask=
prediction_ground_truth_indices_mask,
config=config.SerializeToString(),
scenario_id=scenario_id,
)
Perhaps the issue is that one of the arguments I'm passing to this function is wrong. Let me walk through how I construct each of the inputs and please let me know if something is wrong.
The prediction_trajectory
and prediction_score
are exactly what we submit to the server. The order of the agents is
exactly the same order as provided in the tf_example
.
I construct ground_truth_trajectory
using the following code:
def extract_gt_trajectory(tf_example_original):
"""Extracts the ground truth trajectory for the all tracks. The
returned trajectory is in the frame that the original
data is provided in.
tf_example_original: str -> tensor dict containing the original features
from the Waymo Open Motion Dataset.
Returns:
gt_traj: shape [1, waymo_constants.NUM_OBJECTS, TG, 7]
The 7 states variables are:
[x, y, length, width, heading, velocity_x, velocity_y]
|TG| is the number of steps in the groundtruth track. It includes the past,
current, and future states.
|length| is the length of the bounding box around the track in meters. This
refers to the axis in the direction of the track.
|width| is the width of the bounding box around the track in meters. This
refers to the axis that is parallel to the axle's of the track.
|heading| is the angle of the track in radians.
"""
# Shape = [waymo_constants.NUM_OBJECTS, TG]
track_x_smooth = tf.concat([
tf_example_original["state/past/x"],
tf_example_original["state/current/x"],
tf_example_original["state/future/x"],
],
axis=1)
track_y_smooth = tf.concat([
tf_example_original["state/past/y"],
tf_example_original["state/current/y"],
tf_example_original["state/future/y"],
],
axis=1)
track_length = tf.concat([
tf_example_original["state/past/length"],
tf_example_original["state/current/length"],
tf_example_original["state/future/length"],
],
axis=1)
track_width = tf.concat([
tf_example_original["state/past/width"],
tf_example_original["state/current/width"],
tf_example_original["state/future/width"],
],
axis=1)
track_heading = tf.concat([
tf_example_original["state/past/bbox_yaw"],
tf_example_original["state/current/bbox_yaw"],
tf_example_original["state/future/bbox_yaw"],
],
axis=1)
track_vx_smooth = tf.concat([
tf_example_original["state/past/velocity_x"],
tf_example_original["state/current/velocity_x"],
tf_example_original["state/future/velocity_x"],
],
axis=1)
track_vy_smooth = tf.concat([
tf_example_original["state/past/velocity_y"],
tf_example_original["state/current/velocity_y"],
tf_example_original["state/future/velocity_y"],
],
axis=1)
# Shape = [waymo_constants.NUM_OBJECTS, TG, 1]
track_x_smooth = tf.expand_dims(track_x_smooth, axis=-1)
track_y_smooth = tf.expand_dims(track_y_smooth, axis=-1)
track_length = tf.expand_dims(track_length, axis=-1)
track_width = tf.expand_dims(track_width, axis=-1)
track_heading = tf.expand_dims(track_heading, axis=-1)
track_vx_smooth = tf.expand_dims(track_vx_smooth, axis=-1)
track_vy_smooth = tf.expand_dims(track_vy_smooth, axis=-1)
# Shape = [waymo_constants.NUM_OBJECTS, TG, 7]
gt_traj = tf.concat([
track_x_smooth, track_y_smooth, track_length, track_width,
track_heading, track_vx_smooth, track_vy_smooth
],
axis=2)
# Shape = [1, waymo_constants.NUM_OBJECTS, TG, 7]
gt_traj = tf.expand_dims(gt_traj, axis=0)
return gt_traj
The ground_truth_is_valid
is generated in a similar way:
# Shape = [waymo_constants.NUM_OBJECTS, TG]
gt_is_valid = tf.concat([
tf_example_original["state/past/valid"],
tf_example_original["state/current/valid"],
tf_example_original["state/future/valid"],
],
axis=1)
# Shape = [1, waymo_constants.NUM_OBJECTS, TG]
gt_is_valid = tf.expand_dims(gt_is_valid, axis=0)
The object_type
and object_id
are pretty straightforward:
# Shape = [1, waymo_constants.NUM_OBJECTS]
obj_ids = tf.expand_dims(tf_example_original["state/id"], axis=0)
# Shape = [1, waymo_constants.NUM_OBJECTS]
waymo_obj_types = tf.expand_dims(tf_example_original["state/type"], axis=0)
The prediction_ground_truth_indices
is just an identity vector mapping, where waymo_constants.NUM_OBJECTS = 128
tf.reshape(tf.range(waymo_constants.NUM_OBJECTS, dtype=tf.int64), [1, waymo_constants.NUM_OBJECTS, 1])
prediction_ground_truth_indices_mask
is just bunch of ones of the same shape as prediction_ground_truth_indices
. The reason this works is because we later filter out all predictions that we are not required to predict. More info about this just below.
pred_gt_indices_mask = tf.ones_like(pred_gt_indices, dtype=bool)
We then filter out
tracks_to_predict = tf.boolean_mask(
tf.range(waymo_constants.NUM_OBJECTS, dtype=tf.int64),
tf_example_original["state/tracks_to_predict"] == 1)
# Shapes remain the same.
pred_traj = tf.gather(pred_traj, tracks_to_predict, axis=1)
pred_gt_indices = tf.gather(pred_gt_indices, tracks_to_predict, axis=1)
pred_gt_indices_mask = tf.gather(
pred_gt_indices_mask, tracks_to_predict, axis=1)
pred_score = tf.gather(pred_score, tracks_to_predict, axis=1)
The config is taken straight from the WOMD codebase:
def get_challenge_config():
"""Returns the config used by the Official Waymo Challenge."""
config = motion_metrics_pb2.MotionMetricsConfig()
config_text = """
track_steps_per_second: 10
prediction_steps_per_second: 2
track_history_samples: 10
track_future_samples: 80
speed_lower_bound: 1.4
speed_upper_bound: 11.0
speed_scale_lower: 0.5
speed_scale_upper: 1.0
step_configurations {
measurement_step: 5
lateral_miss_threshold: 1.0
longitudinal_miss_threshold: 2.0
}
step_configurations {
measurement_step: 9
lateral_miss_threshold: 1.8
longitudinal_miss_threshold: 3.6
}
step_configurations {
measurement_step: 15
lateral_miss_threshold: 3.0
longitudinal_miss_threshold: 6.0
}
max_predictions: 6
"""
text_format.Parse(config_text, config)
return config
The scenario_id
is just
tf_example_original['scenario/id']
One final comment. Based on the WOMD documentation here:
- B: batch size. Each batch should contain 1 scenario.
- M: Number of joint prediction groups to predict per scenario.
- N: number of agents in a joint prediction. 1 if mutual independence is
assumed between agents.
- K: top_K predictions per joint prediction.
- A: number of agents in the groundtruth.
- TP: number of steps to evaluate on. Matches len(config.step_measurement).
- TG: number of steps in the groundtruth track. Matches
config.track_history_samples + 1 + config.future_history_samples.
- BR: number of breakdowns.
In my setup:
B = 1 since I only pass in 1 scenario/tf example at a time
M = Number of agents to predict (at most 8) since each prediction group consists of just 1 agent so the number of groups is the same as the number of agents to predict.
N = 1 because we are not doing joint prediction
K = 6 because we output 6 predictions (for the ground truth, I pass in 6 identical trajectories and put all the probability mass on the 1st. This seemed to work fine as the Waymo servers gave me the expected results)
A = 128 (I'm assuming non-valid agents don't matter since we provide the mapping from the prediction agent to the ground truth agent)
TP: 16 since we predict 16 points. (8 seconds of future, 0.5 second intervals apart)
TG = 10 + 1 + 80 = 91
Let me know if the above is correct.
I apologize for the long post, but I hope this explains my setup. Please let me know if anything is not clear. And thank you in advance for the support!
Apologies for the late reply. It is difficult to determine what might be going wrong here. We will take a look at the tf wrapper code to see if we find any issues.
Sure, let me know if I can provide any other information to assist in the process!
@scott-ettinger any updates on this?
Sorry for the delay. We weren't able to find the issue on our side. Are you able to use the validation server as a workaround until this is resolved?
Yeah, the validation server works fine for now, but what is the plan to resolve it? Is this still something you all are looking into or are you closing the case?
Thanks for your patience. This requires a bit of debugging. I will see if we can look into this further.
Just to add some more details to the issue. We noticed that computing the mAP between ground truth predictions fails (outputs 0.0 instead of 1.0) when there are invalid states in the ground truth track.
We can actually reproduce this locally if you look at the 6th scenario in the file waymo_open_dataset_motion_v_1_1_0/scenario/validation/validation.tfrecord-00149-of-00150
. Note this is the 2022 dataset.
Specifically when there is an invalid state, this function returns null-opt: https://github.com/waymo-research/waymo-open-dataset/blob/master/src/waymo_open_dataset/metrics/motion_metrics.cc#L269. And this function is called during the calculation of mAP. The state is determined here: https://github.com/waymo-research/waymo-open-dataset/blob/master/src/waymo_open_dataset/metrics/ops/utils.cc#L354
Apologies we have not resolved this. The TF-ops wrapper code is old and a bit difficult to maintain. I would like to get your feedback. For our next release we could provide a way to compute metrics locally for the validation set with a function that takes a submission file as input. Would this work for your workflow, or would you require a tensorflow based solution?
I think a local script that takes the submission file as input should be sufficient
require a tensorflow based solution
is better in my project as I would not like to create Submission
object/file if I want to use the output of the model directly to evaluate
Hi I am also experiencing the same issue. Can we draw the conclusion that this issue is from the tf ops wrapper for those metrics? So we should only use the minADE/FDE/MR in the local evaluation. If we want to know the mAP we should stick to the result reported by the waymo eval server?