train loss about training_perception

Open Ahnhojin1223 opened this issue 2 months ago • 1 comments

Hi, thank you for sharing GoalFlow!

I'm training the perception model using run_goalflow_training_perception.sh,
but my training loss keeps oscillating and doesn't converge even after 40 epochs.
(Attached TensorBoard screenshot shows losses for agent box/class and BEV semantic.)

Environment:

4× V100 (32GB, NVIDIA DGX)
Batch size: 15 (to fully use VRAM)
Epochs: 40
Other configs: default

Questions:

Is this oscillating loss behavior expected, or am I missing any training setting (LR, warmup, grad clip, etc.)?
Why does the provided script use batch_size=2?
Does GoalFlow assume a specific global batch size or gradient accumulation setting?

Thanks for any advice!

Oct 23 '25 08:10 Ahnhojin1223

The open-source code is the complete version, and the parameters in the shell script are already configured.
When I train on a 4090 GPU with batch_size=2, it already uses up all the VRAM. Could you try reducing the tf_d_model size and see if that helps?

Nov 26 '25 02:11 ZebinX