AvgMinMax median approximation is inconsistent
Describe the bug
The median value in dataset metrics (train_data_utils.py) produces different results on each run, even with identical input data. This causes validation failures when comparing metrics files. The _validate_aggregate_metrics function detects differences in the median field and raises a ValueError about conflicting aggregate metrics.
Steps/Code to reproduce bug
Run data preparation on any dataset, e.g.:
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/library_judge_math/configs/bytedtsinghua_dapo17k.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" \
+output_dirpath=data/bytedtsinghua_dapo17k \
+mode=train_preparation +should_download=true
This may or may not produce a ValueError about conflicting aggregate metrics:
Differences found in aggregate metrics:
[
'Numeric mismatch at {field_name}.Median: 80.33 != 80.44'
]
...
Found conflicting aggregate metrics that need to be corrected:
- resources_servers/math_with_judge/data/dapo17k_train_metrics_conflict.json
- resources_servers/math_with_judge/data/dapo17k_validation_metrics_conflict.json
This could be due to a change in how metrics are calculated, leading to outdated metrics. Try deleting the below file(s) and rerunning data preparation:
- resources_servers/math_with_judge/data/dapo17k_train_metrics.json
- resources_servers/math_with_judge/data/dapo17k_validation_metrics.json
Expected behavior
Metrics should be deterministic. Running data preparation multiple times on the same dataset should produce identical metrics, including the median. The validation check should pass when re-running with unchanged data.
Configs Any dataset configuration.
Environment details
Otherwise, please provide: N/A
Additional context
The AvgMinMax class uses TDigest for median estimation. This is an approximation of the median, and is not guaranteed to be exactly the same on each run.