diffusion_policy
diffusion_policy copied to clipboard
Question: virtual environment rendering/acceleration
Hi there! Thanks for your impressive work and beautiful code :) I tried to run lift_image_abs with transformer hybrid workspace HEADLESS, but it logged that:
[root][INFO] Command '['/mambaforge/envs/robodiff/lib/python3.9/site-packages/egl_probe/build/test_device', '0']' returned non-zero exit status 1.
[root][INFO] - Device 0 is not available for rendering
and it keeps repeating on all of the 4 GPUs. Afterwards, I found the "Eval LiftImage" process is really slow, I wonder if I should turn on or install some driver for hardware acceleration?
nvidia-smi command during Eval (GPU-Util keeps 0%):
top command during Eval:
wandb monitor data:
Hi @AlbertTan404, in my experience the eval process is CPU bound, therefore I'm surpsied the find low CPU usage on your system duing eval. I don't have experience dealing with this problem, but I suspect most of the time is spent inside robomimic enviornments.
Hi @AlbertTan404, in my experience the eval process is CPU bound, therefore I'm surpsied the find low CPU usage on your system duing eval. I don't have experience dealing with this problem, but I suspect most of the time is spent inside robomimic enviornments.
Thanks for your reply. I'll take a look into the inference process in robomimic env.
Hi @AlbertTan404, I recently encoutered similar issue on my machine as well. It turns out to be a bug in recent version of pytorch when installed through conda.
https://github.com/pytorch/pytorch/issues/99625
This bug will cause all subprocesses created after import torch
to inherit the same CPU Affinity to the first CPU core, which cuases all of dataloader workers and robomimc env workers to be squeezed to the same CPU core, drastically decreasing performance.
As described in the pytorch issue, the solution is:
conda install llvm-openmp=14
You can check if you are affected by running this script:
import multiprocessing as mp
def print_affinity():
import time
import psutil
p = psutil.Process()
print('before import torch', p.cpu_affinity())
p = mp.Process(target=print_affinity)
p.start()
p.join()
import torch
def print_affinity():
import time
import psutil
p = psutil.Process()
print('after import torch', p.cpu_affinity())
p = mp.Process(target=print_affinity)
p.start()
p.join()
This is the result on my machine before and after the fix:
I will pin llvm-openmp version in this repo as well.
great thanks! I found it significantly boosts the evaluation process.
btw, it takes long time in my machine for conda install llvm-openmp=14
, while mamba install llvm-openmp=14
works better.
@AlbertTan404 Great! I want to keep this issue open so that other people can find it as well.