multinerf icon indicating copy to clipboard operation
multinerf copied to clipboard

All clues welcome!

Open MikePelton opened this issue 1 year ago • 2 comments

Hi - trying to run either the 360 or raw scripts (with the paths suitably edited) leaves me in an endless loop as below. I take the JAX warnings not to be errors (I get the same running with CPU or GPU JAX) but the code then drops into an endless cycle (xxx is edited in to replace a real path on my machine) - I have no guess as to how to fix this! What might I be doing wrong please?

bash scripts/eval_raw_mjp.sh

I0915 15:39:12.619821 140644963370176 xla_bridge.py:350] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: I0915 15:39:12.695000 140644963370176 xla_bridge.py:350] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host Interpreter CUDA I0915 15:39:12.695568 140644963370176 xla_bridge.py:350] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client' I0915 15:40:46.779565 140644963370176 checkpoints.py:466] Found no checkpoint files in xxx/rawnerf/nerf_results/raw/candle with prefix checkpoint_ Checkpoint step 0 <= last step 0, sleeping. I0915 15:40:56.790609 140644963370176 checkpoints.py:466] Found no checkpoint files in xxx/rawnerf/nerf_results/raw/candle with prefix checkpoint_ Checkpoint step 0 <= last step 0, sleeping. ...and so on ad infinitum....

MikePelton avatar Sep 15 '22 14:09 MikePelton

It looks like the code is trying to use a TPU, but can't find it. You probably want to use your GPU. I'd first verify that your Jax installation can run things on the GPU and then circle back to this codebase.

It also looks like you're running an eval script. Have you run a train script beforehand? The eval script needs some checkpoints to evaluate.

jonbarron avatar Sep 15 '22 15:09 jonbarron

Hi Jon - thanks for coming back so quickly - as expected I was doing something dumb - will run a training script first! Re JAX and the TPU, it seems unless JAX is told explicitly which devices it's looking for it will warn about missing TPU etc but it's not actually an error - I've run with both GPU and CPU JAX versions and the unit tests go okay (bar the fails others have reported here). Presumably you guys are running on kit that has a TPU so you don't see that message? Will loop back once I've run the training - if it's really flagging a problem your colleagues on the JAX side of the house I'm sure will be able to give us a steer.

MikePelton avatar Sep 15 '22 16:09 MikePelton