ikflow icon indicating copy to clipboard operation
ikflow copied to clipboard

Raise nan loss during training

Open YupuLu opened this issue 9 months ago • 13 comments

Hi Jeremy,

Recently I have tried to train the ikflow model for testing. But even using your provided code, the training always failed with ValueError: loss is nan. Have you encountered this situation before? I am also wondering if you could provide the training code for your provided models shown in model_descriptions.yaml.

Best regards

YupuLu avatar Mar 25 '25 11:03 YupuLu

Also I am a bit confused of the setting of softflow_noise_scale. Should it be 0.01 or 0.001?

For the trained model the settings are:

Target model parameters: IkflowModelParameters
  coupling_layer:       glow
  nb_nodes:     12
  dim_latent_space:     7
  coeff_fn_config:      3
  coeff_fn_internal_size:       1024
  permute_random_enabled:       True
  sigmoid_on_output:    False
  lambd_predict:        1.0
  init_scale:   0.04473500291638653
  rnvp_clamp:   2.5
  y_noise_scale:        1e-07
  zeros_noise_scale:    0.001
  softflow_noise_scale:         0.01
  softflow_enabled:     True
  robot_name:   panda
  model_weights_url:    https://storage.googleapis.com/ikflow_models/panda/panda__lyric-puddle-191__global_step%3D5.25M.pkl

For the command you provided: scripts/train.py --robot_name=panda --nb_nodes=12 --coeff_fn_internal_size=1024 --coeff_fn_config=3 --dim_latent_space=7 --batch_size=512 --learning_rate=0.00005 --gradient_clip_val=1 --dataset_tags non-self-colliding, those settings are:

IkflowModelParameters
  coupling_layer:       glow
  nb_nodes:     12
  dim_latent_space:     7
  coeff_fn_config:      3
  coeff_fn_internal_size:       1024
  permute_random_enabled:       True
  sigmoid_on_output:    False
  lambd_predict:        1.0
  init_scale:   0.04473500291638653
  rnvp_clamp:   2.5
  y_noise_scale:        1e-07
  zeros_noise_scale:    0.001
  softflow_noise_scale:         0.001
  softflow_enabled:     True
  run_description:      None

YupuLu avatar Mar 25 '25 11:03 YupuLu

Hi @YupuLu,

0.001 is a good choice for softflow_noise_scale, i've also had success with 0.01. I didn't find a single best value.

What command are you using to start your training? To prevent the loss is nan issue, I recommend reducing the learning rate and increasing the batch size if possible. Generally, you want the learning rate to be as high as possible however. Start at 0.00005, then do 0.0000375, then 0.000025 then 0.0000125 then 0.00001 etc. A larger batch size will enable a higher learning rate.

jstmn avatar Mar 26 '25 18:03 jstmn

Thank you for the suggestions!

First I used the code in readme.md and it failed fastly. Then I remembered you mentioned the code for pandas. For that case it seems to work properly with softflow_noise_scale being either 0.001 or 0.01.

I am just curious about those training settings you used to obtain other successful models like FetchArm, Iiwa7 and Rizon4. So I may get a clue how you adjusted your hyperparameters.

One more thing is the dataset size, will that help if I increase it from 25000000 to 100000000?

YupuLu avatar Mar 27 '25 04:03 YupuLu

Sure, here's the training script arguments for the three robots you mentioned. I should probably add them somewhere into the repo. Please note that these parameters probably weren't used in the ikflow paper too.

fetch_arm__large__mh186_9.25m / FetchArm scripts/train.py --robot_name=fetch_arm --dim_latent_space=10 --learning_rate=0.0000375 --nb_nodes=16 --batch_size=512 --gradient_clip_val=1 --softflow_noise_scale=0.001

iiwa7_full_temp_nsc_tpm / Iiwa7 scripts/train.py --robot_name=iiwa7 --nb_nodes=12 --coeff_fn_internal_size=1024 --coeff_fn_config=3 --dim_latent_space=10 --batch_size=256 --learning_rate=0.0000375

rizon4__snowy-brook-208__global_step=2.75M scripts/train.py --robot_name=rizon4 --nb_nodes=12 --coeff_fn_internal_size=1024 --coeff_fn_config=3 --dim_latent_space=7 --batch_size=512 --learning_rate=0.00005 --gradient_clip_val=1

In general the larger the "sum of the joint limit ranges" is (see Limitations in the paper), the lower the maximum stable learning rate will be. Put differently, a robot with simpler kinematics will allow for more higher learning rate.

Increasing the dataset size is unlikely to help.

jstmn avatar Apr 01 '25 03:04 jstmn

Hi Jeremy, @jstmn For the function IKSolver._generate_exact_ik_solutions(), currently considering the usage in evaluate.py, we call it with solutions_per_pose for each target pose individually.

I am curious that whether we can stack more target poses together and send them into this function at the same time? It seems to me that this way is feasible mathematically (but somewhat complex to implement) and may save more time?

YupuLu avatar Apr 24 '25 09:04 YupuLu

Hi! Sorry for the delay. Im on mobile so bear with me if theres syntax errors here.

def generate_exact_ik_solutions(
        self,
        target_poses: torch.Tensor, …)

You probably want something like this:

target_pose1 = torch.tensor([x y z qw qx qy qz]).view(1,7)
target_pose2 = torch.tensor([x y z qw qx qy qz]).view(1,7)
n_sols = 10

target_batch = torch.cat([target_pose0.expand(nsols,7), target_pose1.expand(n_sols,7)], dim=0)

generate_exact_ik_solutions(target_batch, …)

btw, if you can get solutions faster but at a greater chance of having fewer returned if you set repeat_counts: Tuple[int] = (1, 3), instead of repeat_counts: Tuple[int] = (1, 3, 10),

make sure to use expand() btw (https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html) instead of repeat() (https://pytorch.org/docs/stable/generated/torch.Tensor.repeat.html). repeat copies the data which you dont want to do

this should be done under the hood - ill do a refactor once im back online

jstmn avatar May 02 '25 14:05 jstmn

Thank you Jeremy, the answer is of great help!

I am also wondering how you train the model for ATLAS (2013) - Arm with 6 DOF. Currently I had a difficulty training a ikflow for UR5e. It seems to me that because its joint limit ranges are large, the accuracy cannot reach to an ideal point during training. Do you have any ideas?

YupuLu avatar May 18 '25 04:05 YupuLu

Also I notice those functions repeat() used in _generate_exact_ik_solutions(). I suppose it is possible to replace them with expand()?

YupuLu avatar May 18 '25 05:05 YupuLu

I am also wondering how you train the model for ATLAS (2013) - Arm with 6 DOF. Currently I had a difficulty training a ikflow for UR5e.

It's expected that ikflow will perform poorly for 6dof robots. That's because there is a small and countable set of individual IK solutions for any given pose (<= 32 I believe), rather than a continuous distribution which makes the learning problem ill posed for a generative model. With 7 dof the solution space changes to a continuous set (assuming the robot's not at a singularity). You're best off using an analytic IK solver like IKFast.

Also I notice those functions repeat() used in _generate_exact_ik_solutions(). I suppose it is possible to replace them with expand()?

Yep, they should be replaced

jstmn avatar May 20 '25 21:05 jstmn

Also I notice those functions repeat() used in _generate_exact_ik_solutions(). I suppose it is possible to replace them with expand()?

Yep, they should be replaced

For reference, I have tested _generate_exact_ik_solutions() on our server with AMD 5975WX and RTX 4090, using evaluate.py with panda__full__lp191_5.25m model and 1000 testset_size.

For 100 solutions_per_pose:

  • repeat(): 8.5154 +/- 0.0 ms for 100 solutions
  • expand(): 8.7507 +/- 0.0001 ms for 100 solutions

For 1000 solutions_per_pose:

  • repeat(): 8.835 +/- 0.0002 ms for 100 solutions
  • expand(): 8.5016 +/- 0.0001 ms for 100 solutions

It's expected that ikflow will perform poorly for 6dof robots. That's because there is a small and countable set of individual IK solutions for any given pose (<= 32 I believe), rather than a continuous distribution which makes the learning problem ill posed for a generative model. With 7 dof the solution space changes to a continuous set (assuming the robot's not at a singularity). You're best off using an analytic IK solver like IKFast.

Indeed, the large joint limit ranges lead to individual IK solutions. Thanks.

YupuLu avatar May 23 '25 04:05 YupuLu

Cool, thanks for doing that timing. Not super surprising considering the relatively few solutions per gpu standards.

Indeed, the large joint limit ranges lead to individual IK solutions. Thanks.

Its a different reason than this actually! A 6R robot can have a maximum of 16 solutions (at any non-singular) pose. A 7R robot will always have infinite solutions (i.e. a continuous solution set) for a given pose (again, ignoring singularities). IKFlow is built on a generative model, which only works for continuous distributions. We treat the IK solution set as a distribution. Therefore if you use IKFlow for a 6R robot, you'll be trying to learn a generative model for a discrete distribution - something it's not designed for.

jstmn avatar May 23 '25 22:05 jstmn

btw I figured it out why strange memory usages appear when I use the machine with multiple graphic cards. It seems that PyTorch will pre-allocate a piece of memory (about 300 MiB) when we generate tensors on GPU, even just one tensor. Since we set the default device in config.py, several tensors like _TORCH_EPS_CUDA in math_utils.py generated with device='cuda' will lead to GPU memory usage on other GPUs. Then it will work like this:

| 0 N/A N/A 3966189 C python 386MiB | | 1 N/A N/A 3966189 C python 384MiB | | 2 N/A N/A 3966189 C python 3050MiB |

So I replace all those 'cuda' words with DEVICE imported from jrl.config, then everything works out fine now. Maybe a bit more testing later.

YupuLu avatar May 30 '25 07:05 YupuLu

Thanks for the heads up. I just edited _TORCH_EPS_CUDA to use DEVICE. Should be fixed on most recent jrl, ikflow versions

jstmn avatar Jun 11 '25 01:06 jstmn