rlpyt icon indicating copy to clipboard operation
rlpyt copied to clipboard

Using nvprof doesn't work

Open gauravjain14 opened this issue 5 years ago • 6 comments

I tried the following

python3 example_1.py --cuda_idx=0

This run successfully and I could see that it was using the GPUs.

When I tried the following

nvprof --print-gpu-trace python3 example_1.py --cuda_idx=0

It failed with the following error (which is a bit surprising):

Traceback (most recent call last):
  File "/home/gaurav/Packages/rlpyt/rlpyt/utils/buffer.py", line 15, in buffer_from_example
    buffer_type = namedarraytuple_like(example)
  File "/home/gaurav/Packages/rlpyt/rlpyt/utils/collections.py", line 192, in namedarraytuple_like
    raise TypeError("Input must be namedtuple or namedarraytuple instance"
TypeError: Input must be namedtuple or namedarraytuple instance or class, got <class 'numpy.ndarray'>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example_1.py", line 68, in <module>
    cuda_idx=args.cuda_idx,
  File "example_1.py", line 55, in build_and_train
    runner.train()
  File "/home/gaurav/Packages/rlpyt/rlpyt/runners/minibatch_rl.py", line 229, in train
    n_itr = self.startup()
  File "/home/gaurav/Packages/rlpyt/rlpyt/runners/minibatch_rl.py", line 75, in startup
    rank=rank,
  File "/home/gaurav/Packages/rlpyt/rlpyt/algos/dqn/dqn.py", line 81, in initialize
    self.initialize_replay_buffer(examples, batch_spec)
  File "/home/gaurav/Packages/rlpyt/rlpyt/algos/dqn/dqn.py", line 134, in initialize_replay_buffer
    self.replay_buffer = ReplayCls(**replay_kwargs)
  File "/home/gaurav/Packages/rlpyt/rlpyt/replays/frame.py", line 38, in __init__
    share_memory=self.async_)  # [T+n_frames-1,B,H,W]
  File "/home/gaurav/Packages/rlpyt/rlpyt/utils/buffer.py", line 17, in buffer_from_example
    return build_array(example, leading_dims, share_memory)
  File "/home/gaurav/Packages/rlpyt/rlpyt/utils/buffer.py", line 29, in build_array
    return constructor(shape=leading_dims + a.shape, dtype=a.dtype)
MemoryError

gauravjain14 avatar Nov 11 '19 22:11 gauravjain14

Hmmm, strange. I haven't used nvprof, so I don't immediately know what's wrong. But maybe that MemoryError says that nvprof is somehow interfering with the way that rlpyt is allocating memory? If it's the serial code of example_1, I think it should just be allocating using np.zeros, which I can't imagine would break. Want to double check that build_array is using np.zeros or np_mp_array, and maybe there's something going on there?

Have you used nvprof with the same installation on something other than rlpyt?

astooke avatar Nov 18 '19 17:11 astooke

I have tried it with example_1, example_2, and example_3. I'll check what build_array is using.

I have used nvprof on the same machine (installation) with other applications and they run seamlessly.

gauravjain14 avatar Nov 20 '19 17:11 gauravjain14

Anyone else revisited this?

astooke avatar Mar 02 '20 23:03 astooke

Anyone else revisited this?

I met similar problem as him.When I tried the following:

python main.py --cuda-idx=0

it failed with the following error:

`Traceback (most recent call last): File "/test/dreamer-pytorch-master/rlpyt/utils/buffer.py", line 33, in buffer_from_example buffer_type = namedarraytuple_like(example) File "/test/dreamer-pytorch-master/rlpyt/utils/collections.py", line 203, in namedarraytuple_like raise TypeError("Input must be namedtuple or namedarraytuple instance" TypeError: Input must be namedtuple or namedarraytuple instance or class, got <class 'numpy.ndarray'>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 99, in load_model_path=args.load_model_path File "main.py", line 65, in build_and_train runner.train() File "/test/dreamer-pytorch-master/rlpyt/runners/minibatch_rl.py", line 252, in train n_itr = self.startup() File "/test/dreamer-pytorch-master/rlpyt/runners/minibatch_rl.py", line 95, in startup rank=rank, File "/test/dreamer-pytorch-master/dreamer/algos/dreamer_algo.py", line 82, in initialize self.replay_buffer = initialize_replay_buffer(self, examples, batch_spec) File "/test/dreamer-pytorch-master/dreamer/algos/replay.py", line 24, in initialize_replay_buffer replay_buffer = UniformSequenceReplayBuffer(**replay_kwargs) File "/test/dreamer-pytorch-master/rlpyt/replays/sequence/n_step.py", line 44, in init super().init(example=buffer_example, size=size, B=B, **kwargs) File "/test/dreamer-pytorch-master/rlpyt/replays/n_step.py", line 49, in init share_memory=self.async_) File "/test/dreamer-pytorch-master/rlpyt/utils/buffer.py", line 38, in buffer_from_example for v in example)) File "/test/dreamer-pytorch-master/rlpyt/utils/buffer.py", line 38, in for v in example)) File "/test/dreamer-pytorch-master/rlpyt/utils/buffer.py", line 35, in buffer_from_example return build_array(example, leading_dims, share_memory) File "/test/dreamer-pytorch-master/rlpyt/utils/buffer.py", line 52, in build_array return constructor(shape=leading_dims + a.shape, dtype=a.dtype) MemoryError`

zhaoweiqi626 avatar Jun 20 '20 02:06 zhaoweiqi626

I should have put this earlier. It's not nvprof that causes this issue.

If you try to run example_1.py and there are other memory intensive applications, such as Google Chrome etc, it'd error out even without nvprof. It tells me that it failed to allocated about 7.73GB of memory and hence gives the Memory Error.

Now, the interesting thing is, that if in example_1.py, even when you change the following parameters in the code from their 10e6 values to 10e3, I still get the error. eval_max_steps and n_steps

I didn't debug it a lot but I know it's not because of nvprof because example_2.py and example_3.py work without any issues.

gauravjain14 avatar Jun 20 '20 02:06 gauravjain14

OK interesting! Then the problem is probably allocated the replay buffer. Try DQN(replay_size=int(1e5))? The default is 1e6.

astooke avatar Jun 30 '20 17:06 astooke