Inquire about the training and the installation
Hi Jeremy,
Thank you for providing the official implementation code! Some confusing points came up during applying the code and I am wondering if you can help me.
- I am new to poetry and am using Ubuntu 20.04. When solving the dependencies, it always tried to install klampt = "0.9.1.post6" and returned error since there is no supporting version. I have no idea why this happened since it seems to me that the version is fixed to "0.9.1.post5" in the jrl project.
- How long will the training takes for the panda robot, for the example, "python scripts/train.py --robot_name=panda --nb_nodes=12 --batch_size=128 --learning_rate=0.0005"? In the paper, it says the training is with a NVIDIA GeForce RTX 2080 Ti graphics card. For me it took several hours for one epoch, but the max epoch is set to 5000. How is that possible to finish a training process? Is there anything that I missed?
I will be greatly appreciated if I can hear from you soon.
Best regards, Yupu
Hi Yupu,
- I'm not sure what's causing that. I just bumped klampt to 0.9.2 in Jrl, can you try a fresh install?
- The max-epoch won't be reached! Training is designed to only stop manually. For the final trained models I let training run for several days. Checkout this comment (and the whole thread) to see what an expected time vs. pose error plot should look like: https://github.com/jstmn/ikflow/issues/6#issuecomment-1983985747
Also if you're training a panda robot, use these parameters: --robot_name=panda --nb_nodes=12 --coeff_fn_internal_size=1024 --coeff_fn_config=3 --dim_latent_space=7 --batch_size=512 --learning_rate=0.00005 --gradient_clip_val=1 --dataset_tags non-self-colliding
Good luck! let me know if you have any other issues.
- jeremy
And just to be clear, there are pretrained models you can use: python scripts/evaluate.py --testset_size=500 --model_name=panda__full__lp191_5.25m for example. All models found here: https://github.com/jstmn/ikflow/blob/master/ikflow/model_descriptions.yaml
Thank you for your reply! It really helps clarifying my confusion. I will try to install and test if everything works.
Still, I am wondering if the installation requirement can be loosed, such as the version of python (only 3.8) and pytorch (2.3). Will it work with pytorch 2.0 or python 3.9? If so, it will be easier to cooperate with many other projects for expansion.
Also I am somewhat new to robotic manipulator. If I want to utilize a learning model with jrl to other application (for instance, using pybullet) for the same robot like franka panda, is there anything that I should be aware of? Thanks in advance!
I managed to make it worked with python 3.9 and pytorch 2.0.1. Still not sure what will happen and will report anything if it is valuable.
Hi @YupuLu ,
Great, sounds like you got it working. Right now I only have python 3.8 allowed because it would be extra work to ensure it works on other python versions. I would guess the code should work fine for later python versions too. I think pytorch just needs to be > 2.0 because that's when setting the default type and device was introduced.
Did you do it by editing pyproject.toml? If so can you post it in this thread so others can see.
"If I want to utilize a learning model with jrl to other application (for instance, using pybullet) for the same robot like franka panda, is there anything that I should be aware of?"
The thing you need to check is whether the urdfs are the same. To ensure they are the same, you can use the urdf used by IKFlow, which will be stored at ~/.cache/jrl/temp_urdfs/. Otherwise, you'll need to verify the pybullet, and ikflow urdfs are identical.
Once you get ikflow working with pybullet, can you share the steps required in this thread? I'm curious to hear myself, and will be helpful for others
Hi Jeremy @jstmn,
Edit1: I checked the package version and the version of torch is still 2.4.0...
Edit2: Tested multiple times and found out a complicated way to install torch==2.0.1. I have no idea why '--no-update' did not work when I use poetry lock --no-update and poetry kept updating torch to 2.4.0, so I just comment all the lines related to torch.
I am still quite unfamiliar with poetry so I am not sure what I did and why it worked. But here are my installation steps:
- Create a new conda environment with python 3.9.
- Clone the jrl project, delete the file poetry.lock, and then modify the pyproject.toml (change python version requirement to
python = "^3.8.0"and comment the linetorch = "2.3") - Run (maybe
poetry lock --no-updatefirst)poetry installto install jrl. - It seems that sometimes poetry doesn't work well with torch 2.0.1 (If you have installed torch, after the installation, an error returns when you import jrl). I just reinstalled torch with pip locally to fix this problem because I downloaded the package before.
- Clone this ikflow project, delete the file poetry.lock, and then modify the pyproject.toml (change python version requirement to
python = "^3.8.0"and comment linesFrEIA = "0.2",jrl = ...andpytorch-lightning = "1.8.6"). - Run (maybe
poetry lock --no-updatefirst)poetry installto install ikflow. - Install FrEIA, pytorch-lightning, and pytorch using pip:
pip install FrEIA==0.2 pytorch-lightning==1.8.6 torch==2.0.1+cu117
Did you do it by editing pyproject.toml? If so can you post it in this thread so others can see.
I am developing my project and will test to see if everything works fine or not.
Thank you for your suggestions. I haven't tried such things before and it may take time for me to finish the verification. Wish me good luck :)
The thing you need to check is whether the urdfs are the same. To ensure they are the same, you can use the urdf used by IKFlow, which will be stored at ~/.cache/jrl/temp_urdfs/. Otherwise, you'll need to verify the pybullet, and ikflow urdfs are identical.
Hi Jeremy @jstmn ,
I notice that the data loading is not totally consistent. During training, some resources related to the robot model will be always loaded to "cuda:0". This problem can be reproduced when I call get_robot('panda') with DEVICE='cuda:3' in jrl.config.py.
| 0 N/A N/A 2527243 C python 510MiB |
| 0 N/A N/A 2527450 C python 510MiB |
| 3 N/A N/A 2527243 C python 3456MiB |
| 3 N/A N/A 2527450 C python 3456MiB |
Sounds like DEVICE from jrl/config.py isn't being used everywhere. Which variables specifically have the wrong cuda device?
Well I did a simple test just now and here is the script I used with device='cuda:3':
from jrl.robots import get_robot
import time
if __name__ == "__main__":
time.sleep(1000)
As long as I used the first line, the problem happened. Even if I commented all the contents in jrl.robots.py except the function get_robot(), the memory (510MiB) related to cuda:0 will still be occupied. So I suppose that the fault is not related to the variables in the jrl project but has something related to the installation?
BTW, Would you mind providing the negative log likelihood curve for reference during training, just like the post you mentioned before?
What's the actual error your getting? Can you include the stack trace
Sure, here's the curve:
What's the actual error your getting? Can you include the stack trace
Actually there was no error. More easily, I entered Python through terminal with import jrl and then monitored with nvidia-smi in another section, there was a 510 MiB usage related to gpu0. But I am confusing why the importing action will lead to the gpu usage.
I notice the output like "Warp 0.10.1 initialized.....CUDA Toolkit: ...Devices:...Kernel cache:..." when mporting jrl. It seems to me that this step will take up the gpu usage so I suppose it has nothing to do with the package itself?
It could be from the forward-kinematics cache operation done here: https://github.com/jstmn/jrl/blob/master/jrl/robot.py#L236
The '"Warp 0.10.1 initialized.....CUDA Toolkit: ...Devices:...Kernel cache:..." ' happens whenever you call import warp, so that's probably not it.
Hi Jeremy @jstmn,
About the Jrl library, if I want to add new robots into it, for example, ur3, what is the correct step to do that? Here is my solution based on my understanding:
- Implement the class Ur3() in
robots.py - Download related urdf files from some projects like urdf_files_dataset or generate them using ROS based on your README.md
- Generate the capsule folder using
calculate_capsule_approximation.py
Are these steps enough? Should I use calculate_ignorable_link_collision_pairs.py and calculate_rotational_repeatability.py?
Yep! That looks like it.
Should I use
calculate_ignorable_link_collision_pairs.pyandcalculate_rotational_repeatability.py?
Yes, run calculate_ignorable_link_collision_pairs.py and save the output to the top of robots.py like is here:
RIZON4_ALWAYS_COLLIDING_LINKS = []
RIZON4_NEVER_COLLIDING_LINKS = [...]
# in __init__:
ignored_collision_pairs = RIZON4_NEVER_COLLIDING_LINKS + RIZON4_ALWAYS_COLLIDING_LINKS
Robot.__init__(
self,
Rizon4.name,
urdf_filepath,
active_joints,
base_link,
end_effector_link_name,
ignored_collision_pairs,
collision_capsules_by_link,
verbose=verbose,
additional_link_name=None,
)
No need to run calculate_rotational_repeatability.py, just use ROTATIONAL_REPEATABILITY_DEG = 0.1
Also, can you do me a favor and open a new issue for this with this same question? Easier for others to find this info in the future.
Thanks
Sure and I raised the issue within the jrl project.
BTW, there are some modifications related to the jrl code, fixing bugs related to calculate_capsule_approximation.py, add the Ur3 model, and complete the collision_capsules_by_link for iiwa robots. Can I add a pull request for your reference?
Thanks. "Can I add a pull request for your reference?" yep! thanks
Once you get ikflow working with pybullet, can you share the steps required in this thread? I'm curious to hear myself, and will be helpful for others
Hi Jeremy @jstmn,
I forgot to tell you that I have tested some robot models using mujoco. For franka and iiwa7, it works fine, at least their kinematics are the same. But for UR5e, the urdf file we use here is slightly different from the old version used by google_deepmind: mujoco_menagerie, with some parameters changing like from 0.392 to 0.3922 (ours). As a results, the eef will be varied slightly.
It is not that easy for me to transfer the urdf file to xml file used in mujoco. But I found an old urdf version shown in another project pybullet_ur5_gripper, which works fine with theirs. So I assume our ikflow models can work in pybullet env as well.