baseline training problem
Hi, it's me again🙂 and I have another scenario I'd like to get your feedback on and here are the details:
I re-cached the datasets and was planning to train a baseline model without style using your script run_diffusiondrive_training.sh. After the training process completed, I ran the evaluation scripts and was surprised to find that the results are almost identical to those from the diffusiondrive-style model.
Finished running evaluation.
Number of successful scenarios: 4045.
Number of failed scenarios: 0.
Final average score of valid results: 0.8475605520408238.
Additionally, the experiment_name you provided for evaluation in scripts/evaluation/run_diffusiondrive_style.sh is "eval_diff_style_agent_ablation". Could you clarify what the term "ablation" specifically refers to here, whether it relates to the model structure, training strategy, or something else, or simply just a typo, it is actually "eval_diff_style_agnet" and nothing else? This experiment name really makes me confused😣. I really want you to provide the explainations and thank you so much!
HI @yougrianes
For the first issue:
We have carefully reviewed the original settings in our local Codebase, and did not find any similar issues.
Please note that we directly evaluate the last checkpoint from each training process, and all of these last checkpoints have been uploaded to Hugging Face. Your re-implementation results look unusual — could you provide more details about your setup so we can help investigate?
For the second issue: The term "ablation" was originally used for our own internal ablation study. In the current version of the code, it does not refer to anything. Apologize for this typo and we have removed it.
HI @yougrianes
For the first issue: We have carefully reviewed the original settings in our local Codebase, and did not find any similar issues.
![]()
Please note that we directly evaluate the last checkpoint from each training process, and all of these last checkpoints have been uploaded to Hugging Face. Your re-implementation results look unusual — could you provide more details about your setup so we can help investigate?
For the second issue: The term "ablation" was originally used for our own internal ablation study. In the current version of the code, it does not refer to anything. Apologize for this typo and we have removed it.
thanks for sharing this info, I'll go over my training process again and loop you in if I find anything new. Best Regards😊
