GoalFlow icon indicating copy to clipboard operation
GoalFlow copied to clipboard

training issue

Open popov1212 opened this issue 3 months ago • 4 comments

Hi, Thanks for your excellent work. I tried to run the train script as follows with updated code:

export OPENSCENE_DATA_ROOT=/data/navsim_workspace/dataset # 根据您的实际数据路径调整 export NAVSIM_EXP_ROOT=/data/navsim_workspace/exp # 根据您的实际实验路径调整 export NUPLAN_MAPS_ROOT=/data/navsim_workspace/dataset/maps export PYTHONPATH=/data/workspace/GoalFlow:/data/navsim_workspace/navsim:$PYTHONPATH

FEATURE_CACHE='/data/navsim_workspace/exp/goalflow_trainvalcache' # set your feature_cache path V99_PRETRAINED_PATH='/data/workspace/GoalFlow/data/depth_pretrained_v99-3jlw0p36-20210423_010520-model_final-remapped.pth' CHECKPOINT_PATH=/data/workspace/GoalFlow/data/goalflow_traj_epoch_54-step_18260.ckpt VOC_PATH='/data/workspace/GoalFlow/data/cluster_points_8192_.npy' ONLY_PERCEPTION=False FREEZE_PERCEPTION=True # you can choose False and increase batch_size if the GPU are sufficient

python /data/workspace/GoalFlow/navsim/planning/script/run_training.py
agent=goalflow_agent_traj
experiment_name=a_train_traj
scene_filter=navtrain
split=trainval
cache_path=$FEATURE_CACHE
trainer.params.max_epochs=100
agent.config.training=True
agent.config.has_navi=True
agent.config.start=True
agent.config.freeze_perception=$FREEZE_PERCEPTION
agent.config.only_perception=$ONLY_PERCEPTION
agent.config.train_scale=0.1
agent.config.tf_d_model=1024
agent.config.trajectory_weight=50.0
agent.config.agent_class_weight=0.2
agent.config.agent_box_weight=0.05
agent.config.bev_semantic_weight=0.2
agent.config.agent_loss=True
dataloader.params.batch_size=2
use_cache_without_dataset=True
agent.config.v99_pretrained_path=$V99_PRETRAINED_PATH
agent.config.voc_path=$VOC_PATH

but the loss seems weird as follows:

Image the picture shows above is in the No.0 epoch, but the trend of loss curve seems same as above in No.20 epoch. I tried to modified the batch_size, seems not work. Also, I tried to remove/add "agent.checkpoint_path=$CHECKPOINT_PATH", but seems not work as well.

by the way, I use 2*5090 for training. And the files are download for your relased google drive, including "depth_pretrained_v99-3jlw0p36-20210423_010520-model_final-remapped.pth", "goalflow_traj_epoch_54-step_18260.ckpt", "cluster_points_8192_.npy".

could you help me to fingure out what is the cause of this issue? Thanks in ahead for your kindly reply!

popov1212 avatar Sep 09 '25 04:09 popov1212

Hi, Thanks for your excellent work. I tried to run the train script as follows with updated code:

export OPENSCENE_DATA_ROOT=/data/navsim_workspace/dataset # 根据您的实际数据路径调整 export NAVSIM_EXP_ROOT=/data/navsim_workspace/exp # 根据您的实际实验路径调整 export NUPLAN_MAPS_ROOT=/data/navsim_workspace/dataset/maps export PYTHONPATH=/data/workspace/GoalFlow:/data/navsim_workspace/navsim:$PYTHONPATH

FEATURE_CACHE='/data/navsim_workspace/exp/goalflow_trainvalcache' # set your feature_cache path V99_PRETRAINED_PATH='/data/workspace/GoalFlow/data/depth_pretrained_v99-3jlw0p36-20210423_010520-model_final-remapped.pth' CHECKPOINT_PATH=/data/workspace/GoalFlow/data/goalflow_traj_epoch_54-step_18260.ckpt VOC_PATH='/data/workspace/GoalFlow/data/cluster_points_8192_.npy' ONLY_PERCEPTION=False FREEZE_PERCEPTION=True # you can choose False and increase batch_size if the GPU are sufficient

python /data/workspace/GoalFlow/navsim/planning/script/run_training.py agent=goalflow_agent_traj experiment_name=a_train_traj scene_filter=navtrain split=trainval cache_path=$FEATURE_CACHE trainer.params.max_epochs=100 agent.config.training=True agent.config.has_navi=True agent.config.start=True agent.config.freeze_perception=$FREEZE_PERCEPTION agent.config.only_perception=$ONLY_PERCEPTION agent.config.train_scale=0.1 agent.config.tf_d_model=1024 agent.config.trajectory_weight=50.0 agent.config.agent_class_weight=0.2 agent.config.agent_box_weight=0.05 agent.config.bev_semantic_weight=0.2 agent.config.agent_loss=True dataloader.params.batch_size=2 use_cache_without_dataset=True agent.config.v99_pretrained_path=$V99_PRETRAINED_PATH agent.config.voc_path=$VOC_PATH

but the loss seems weird as follows:

Image the picture shows above is in the No.0 epoch, but the trend of loss curve seems same as above in No.20 epoch. I tried to modified the batch_size, seems not work. Also, I tried to remove/add "agent.checkpoint_path=$CHECKPOINT_PATH", but seems not work as well. by the way, I use 2*5090 for training. And the files are download for your relased google drive, including "depth_pretrained_v99-3jlw0p36-20210423_010520-model_final-remapped.pth", "goalflow_traj_epoch_54-step_18260.ckpt", "cluster_points_8192_.npy".

could you help me to fingure out what is the cause of this issue? Thanks in ahead for your kindly reply!

same problem with you.

fengjiang5 avatar Sep 16 '25 08:09 fengjiang5

@popov1212 Sorry for the delayed reply. It may happen when training the whole model directly. Have you tried first training the perception, and then training the trajectory decoder?

ZebinX avatar Oct 07 '25 03:10 ZebinX

I use your pretrained perception model and directly train the trajectory decoder

popov1212 avatar Oct 11 '25 09:10 popov1212

@popov1212 Is the perception module frozen or not? If the perception module is frozen, such situation could indeed occur.

ZebinX avatar Oct 14 '25 12:10 ZebinX