ikflow icon indicating copy to clipboard operation
ikflow copied to clipboard

The training results do not meet expectations

Open weishen-kit opened this issue 7 months ago • 3 comments

Hello, I tried to use ikflow to train a custom 8 - DOF robotic arm, but the training effect did not meet my expectations, with large average position and angle errors.

solutions clamped to joint limits: True non-self-colliding testset: True solutions refined: False

Average positional error: 45.7583 mm Average rotational error: 9.4939 deg Percent joint limits exceeded: 0.0 % Percent self-colliding: 3.82 % Average runtime: 5.6242 +/- 0.0003 ms for 100 solutions 0.0562 ms per solution

My training set size is 1.3M, the learning rate is set to 1e - 4, the number of nb_nodes is 12, the batch_size is set to 2048, and I am using a RTX 3090. Can you provide any references or suggestions? Looking forward to your reply.

Image

weishen-kit avatar Jun 06 '25 11:06 weishen-kit

Hey, here's what i'd recommend

  1. Increase the number of coupling layers to 14 (--nb_nodes=14)
  2. Then, increase the learning rate until you notice training instability. Specifically, try 1e-4, 1.125e-4, 1.25e-4, 1.375e-4 until the loss/metrics 'explod'
  3. Also, you're gonna need a lot more training steps. On the order of 3-8 million for a final model. See this comment

What are the joint limits for each joint?

Also, checkout figure 8 in the ikflow paper to get a sense of how many training steps you'll need. Image

jstmn avatar Jun 09 '25 02:06 jstmn

Image

I apologize for the delayed response. During this time, I’ve made the following attempts:

  1. Increased nb_nodes from 12 to 14 and experimented with different learning rates. The current setup is relatively stable at 7.5e-5, while learning rates of 1e-4 or higher tend to cause gradient explosion.

  2. The current parameters are for an 8-DOF robotic arm with a joint limit sum of 36. I modified the collision detection to always return False. The training set size is 2.5 million, and the validation set size is 1,000 (is this too small? The default was 500). The batch_size is 512, and the initial learning rate is 7.5e-5. All other parameters remain default. After 9 million steps, the L2 error appears to stabilize around 3.5 cm. I will continue monitoring the progress.

Additionally, I’d like to ask the following questions and propose some experiments:

  1. The current arm primarily uses rotational joints. Would it be feasible to add a prismatic joint, increasing the DOF to 9?

  2. Could we try increasing the batch_size to 1024 or 2048 for comparison?

  3. Is the default validation set size too small?

  4. In some cases, my robot may operate near joint limits or zero positions. Would adjusting the test set generation code to oversample these regions improve training?

Honestly, your work is truly outstanding! I’d greatly appreciate your feedback and guidance.

weishen-kit avatar Jun 17 '25 03:06 weishen-kit

Seems like you've made some nice progress!

If you want to aim for even better performance you can increase the number of nodes further, but fair warning I haven't gone beyond 14. Honestly though the ~3cm positional error your seeing is going to be more than enough accuracy for rapid refinement with traditional IK. That functionality is provided natively with IKFLowSolver.generate_exact_ik_solutions btw. How many solutions are you going to be querying for at once BTW?

  1. Yes, you can add a prismatic joint that should be fine. Any joint limit range in the real number space should be fine.
  2. Yes, please do. In general I found that increasing the batch size only improved model accuracy.
  3. I'd imagine oversampling regions of cspace will result in the model biasing its solutions towards these regions but I have never tested that. Let me know what you find out if you try this yourself!

Thanks for the kind note! Good luck, let me know how it goes.

jstmn avatar Jun 22 '25 21:06 jstmn