rlpyt
rlpyt copied to clipboard
Using nvprof doesn't work
I tried the following
python3 example_1.py --cuda_idx=0
This run successfully and I could see that it was using the GPUs.
When I tried the following
nvprof --print-gpu-trace python3 example_1.py --cuda_idx=0
It failed with the following error (which is a bit surprising):
Traceback (most recent call last):
File "/home/gaurav/Packages/rlpyt/rlpyt/utils/buffer.py", line 15, in buffer_from_example
buffer_type = namedarraytuple_like(example)
File "/home/gaurav/Packages/rlpyt/rlpyt/utils/collections.py", line 192, in namedarraytuple_like
raise TypeError("Input must be namedtuple or namedarraytuple instance"
TypeError: Input must be namedtuple or namedarraytuple instance or class, got <class 'numpy.ndarray'>.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example_1.py", line 68, in <module>
cuda_idx=args.cuda_idx,
File "example_1.py", line 55, in build_and_train
runner.train()
File "/home/gaurav/Packages/rlpyt/rlpyt/runners/minibatch_rl.py", line 229, in train
n_itr = self.startup()
File "/home/gaurav/Packages/rlpyt/rlpyt/runners/minibatch_rl.py", line 75, in startup
rank=rank,
File "/home/gaurav/Packages/rlpyt/rlpyt/algos/dqn/dqn.py", line 81, in initialize
self.initialize_replay_buffer(examples, batch_spec)
File "/home/gaurav/Packages/rlpyt/rlpyt/algos/dqn/dqn.py", line 134, in initialize_replay_buffer
self.replay_buffer = ReplayCls(**replay_kwargs)
File "/home/gaurav/Packages/rlpyt/rlpyt/replays/frame.py", line 38, in __init__
share_memory=self.async_) # [T+n_frames-1,B,H,W]
File "/home/gaurav/Packages/rlpyt/rlpyt/utils/buffer.py", line 17, in buffer_from_example
return build_array(example, leading_dims, share_memory)
File "/home/gaurav/Packages/rlpyt/rlpyt/utils/buffer.py", line 29, in build_array
return constructor(shape=leading_dims + a.shape, dtype=a.dtype)
MemoryError
Hmmm, strange. I haven't used nvprof, so I don't immediately know what's wrong. But maybe that MemoryError
says that nvprof is somehow interfering with the way that rlpyt is allocating memory? If it's the serial code of example_1, I think it should just be allocating using np.zeros
, which I can't imagine would break. Want to double check that build_array
is using np.zeros
or np_mp_array
, and maybe there's something going on there?
Have you used nvprof with the same installation on something other than rlpyt?
I have tried it with example_1, example_2, and example_3. I'll check what build_array is using.
I have used nvprof on the same machine (installation) with other applications and they run seamlessly.
Anyone else revisited this?
Anyone else revisited this?
I met similar problem as him.When I tried the following:
python main.py --cuda-idx=0
it failed with the following error:
`Traceback (most recent call last): File "/test/dreamer-pytorch-master/rlpyt/utils/buffer.py", line 33, in buffer_from_example buffer_type = namedarraytuple_like(example) File "/test/dreamer-pytorch-master/rlpyt/utils/collections.py", line 203, in namedarraytuple_like raise TypeError("Input must be namedtuple or namedarraytuple instance" TypeError: Input must be namedtuple or namedarraytuple instance or class, got <class 'numpy.ndarray'>.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 99, in
I should have put this earlier. It's not nvprof that causes this issue.
If you try to run example_1.py and there are other memory intensive applications, such as Google Chrome etc, it'd error out even without nvprof. It tells me that it failed to allocated about 7.73GB of memory and hence gives the Memory Error.
Now, the interesting thing is, that if in example_1.py, even when you change the following parameters in the code from their 10e6 values to 10e3, I still get the error. eval_max_steps and n_steps
I didn't debug it a lot but I know it's not because of nvprof because example_2.py and example_3.py work without any issues.
OK interesting! Then the problem is probably allocated the replay buffer. Try DQN(replay_size=int(1e5))
? The default is 1e6
.