nnest icon indicating copy to clipboard operation
nnest copied to clipboard

CUDA capabilities

Open JohannesBuchner opened this issue 5 years ago • 3 comments

Hi again,

I was wondering if you tested this code with CUDA? In the examples it seems GPUs are explicitly disabled via an environment variable. When I remove this restriction, I get torch errors related to whether an object lives on the GPU or CPU.

Some examples:

Traceback (most recent call last):
  File "examples/mcmc/rosenbrock.py", line 55, in <module>
    main(args)
  File "examples/mcmc/rosenbrock.py", line 30, in main
    bootstrap_iters=args.burnin_iters, bootstrap_mcmc_steps=5000 + 1000 * args.x_dim)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/mcmc.py", line 169, in run
    self.trainer.train(samples, max_iters=train_iters, noise=-1)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/multitrainer.py", line 171, in train
    epoch, self.netG, train_loader, noise=training_noise)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/trainer.py", line 380, in _train
    loss = -model.log_probs(data, cond_data).mean()
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 202, in log_probs
    return self.net.log_probs(inputs, cond_inputs=None)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 165, in log_probs
    u, log_jacob = self(inputs, cond_inputs)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 155, in forward
    inputs, logdet = module(inputs, cond_inputs, mode)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 119, in forward
    masked_inputs = inputs * mask
RuntimeError: expected backend CUDA and dtype Float but got backend CPU and dtype Float

Traceback (most recent call last):
  File "examples/mcmc/rosenbrock.py", line 55, in <module>
    main(args)
  File "examples/mcmc/rosenbrock.py", line 30, in main
    bootstrap_iters=args.burnin_iters, bootstrap_mcmc_steps=5000 + 1000 * args.x_dim)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/mcmc.py", line 169, in run
    self.trainer.train(samples, max_iters=train_iters, noise=-1)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/multitrainer.py", line 171, in train
    epoch, self.netG, train_loader, noise=training_noise)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/trainer.py", line 380, in _train
    loss = -model.log_probs(data, cond_data).mean()
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 202, in log_probs
    return self.net.log_probs(inputs, cond_inputs=None)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 165, in log_probs
    u, log_jacob = self(inputs, cond_inputs)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 155, in forward
    inputs, logdet = module(inputs, cond_inputs, mode)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/daten/PostDoc2/home/Downloads/nnest/nnest/networks.py", line 124, in forward
    log_s = self.scale_net(masked_inputs) * (1 - mask)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 92, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/user/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1406, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #4 'mat1'

I wanted to test with CUDA to see how much faster the training is.

JohannesBuchner avatar May 05 '19 12:05 JohannesBuchner

Thanks Johannes. This is indeed a bug and I'm currently working on a fix to make it CUDA compatible. Thanks also for your other PR - will be taking a look at that very soon.

Cheers, Adam

adammoss avatar May 07 '19 13:05 adammoss

Hi @JohannesBuchner , this should now be fixed in https://github.com/adammoss/nnest/commit/a4fca887c7f69a3cc18819ba5632925a02a0ab36. I'm actually getting better performance on the CPU. My GPU isn't the latest a (1080), but it could be this isn't as well suited to a GPU as things are frequently moved back and forth between the CPU and CPU. I'd be interested if you get similar results.

Cheers, Adam

adammoss avatar May 08 '19 12:05 adammoss

Hi @adammoss, I tried to test the difference of using CPU and GPU, but I keep getting this error at the time of the training when I set 'use_gpu=True' :

Traceback (most recent call last): File "run.py", line 81, in main(args) File "run.py", line 52, in main jitter=args.jitter) File "/home/users/ahe106/git/nnest/nnest/nested.py", line 264, in run self.trainer.train(active_u, max_iters=train_iters, jitter=jitter) File "/home/users/ahe106/git/nnest/nnest/trainer.py", line 202, in train train_loss = self._train(epoch, train_loader, jitter=training_jitter, l2_norm=l2_norm) File "/home/users/ahe106/git/nnest/nnest/trainer.py", line 394, in _train loss = -self.netG.log_probs(data).mean() File "/home/users/ahe106/git/nnest/nnest/networks.py", line 72, in log_probs u, log_det = self.forward(inputs) File "/home/users/ahe106/git/nnest/nnest/networks.py", line 66, in forward return self.flow.forward(x) File "/home/users/ahe106/git/nnest/nnest/networks.py", line 30, in forward log_det += ld RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Is it necessary to do something extra to use the GPU?

alfa33333 avatar Aug 20 '20 02:08 alfa33333