muzero-general icon indicating copy to clipboard operation
muzero-general copied to clipboard

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

Open lukaszkn opened this issue 2 years ago • 3 comments

Any idea how to fix this error below? This happens for every sample game.

ray==1.5.0 torch==1.9.1+cu111

Thanks

2021-09-24 17:22:36,982 ERROR worker.py:79 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Reanalyse.reanalyse() (pid=11872, ip=192.168.0.107)
  File "python\ray\_raylet.pyx", line 534, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 484, in ray._raylet.execute_task.function_executor
  File "lib\site-packages\ray\_private\function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "muzero-general\replay_buffer.py", line 350, in reanalyse
    self.model.initial_inference(observations)[0],
  File "muzero-general\models.py", line 173, in initial_inference
    encoded_state = self.representation(observation)
  File "muzero-general\models.py", line 135, in representation
    observation.view(observation.shape[0], -1)
  File "lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "lib\site-packages\torch\nn\parallel\data_parallel.py", line 156, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

lukaszkn avatar Sep 24 '21 15:09 lukaszkn

Disclaimer: I've only been playing around with this for a couple hours and all of this is fairly new to me, but I hope others may find it useful.

Short version:

Set reanalyse_on_gpu in/on MuZeroConfig to True, (or equal to torch.cuda.is_available())).

This avoids the error as it won't try to use the buffer from cuda on the cpu when performing the reanalyse stage.

Longer version

You seem to be using Windows given the file paths are using \

Windows support is considered Experimental. As far as I can tell is due to Ray's Windows support being incomplete (ray-project/ray#199). That is a top-level issue for tracking Windows Support and despite it being closed, support is still incomplete.

I was able to run the sample (the Connect4 game) using CPU mode. This was because the pytorch I had installed didn't have the CUDA / GPU support so it ignored it. Since installing pytorch with cuda11, I've been getting the same error as you and I get the same error when running the sample from issue #66.

For reference my versions are ray 1.7.1, torch 1.10.0+cu113.

I came across a way to fix the error however I am not sure what consequence it has. Essentially, the change I made was to set reanalyse_on_gpu to True, as it would seem the data for reanalyse stage is being run on the CPU, but the parameters/buffers are on the GPU. As mentioned above, I am brand new to Ray and PyTorch. I suspect what might be a better solution is to transfer the data from cuda device to cpu device when transitioning to the reanalyse stage. Again, I have no idea if that is optimal or is a good idea.

I have only seen from 1s/it to almost 2s/it. when using GPU.

donno avatar Oct 23 '21 05:10 donno

Disclaimer: I've only been playing around with this for a couple hours and all of this is fairly new to me, but I hope others may find it useful.

Short version:

Set reanalyse_on_gpu in/on MuZeroConfig to True, (or equal to torch.cuda.is_available())).

This avoids the error as it won't try to use the buffer from cuda on the cpu when performing the reanalyse stage.

Longer version

You seem to be using Windows given the file paths are using \

Windows support is considered Experimental. As far as I can tell is due to Ray's Windows support being incomplete (ray-project/ray#199). That is a top-level issue for tracking Windows Support and despite it being closed, support is still incomplete.

I was able to run the sample (the Connect4 game) using CPU mode. This was because the pytorch I had installed didn't have the CUDA / GPU support so it ignored it. Since installing pytorch with cuda11, I've been getting the same error as you and I get the same error when running the sample from issue #66.

For reference my versions are ray 1.7.1, torch 1.10.0+cu113.

I came across a way to fix the error however I am not sure what consequence it has. Essentially, the change I made was to set reanalyse_on_gpu to True, as it would seem the data for reanalyse stage is being run on the CPU, but the parameters/buffers are on the GPU. As mentioned above, I am brand new to Ray and PyTorch. I suspect what might be a better solution is to transfer the data from cuda device to cpu device when transitioning to the reanalyse stage. Again, I have no idea if that is optimal or is a good idea.

I have only seen from 1s/it to almost 2s/it. when using GPU.

Same problem, I have tried this but it's not working. still return error

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

version:

ray: 1.13.0 torch: 1.8.2+cu111

digits122 avatar Jul 31 '22 18:07 digits122

I have solved this problem, but not really "solved", bypassed. for example, if you try to run connect4, you need to change connect4.py find def init(self):

and change this line

self.reanalyse_on_gpu = False

to the following

self.reanalyse_on_gpu = True
self.train_on_gpu = True
self.selfplay_on_gpu = True

and it works fine.

if you want to play another game, just change the other .py file, add this 3 config. this parameter forces everything to work on GPU, so there won't be any cpu/gpu problems.

digits122 avatar Aug 01 '22 07:08 digits122