cudnn.torch icon indicating copy to clipboard operation
cudnn.torch copied to clipboard

Error when trying to use CudaHalfTensor for training

Open mbcel opened this issue 8 years ago • 4 comments

I want to train my model using fp16 precision on my gpu. My gpu has the Pascal architecture and the cutorch.hasHalf flag indicates true. I am using cuDNN 5.1 and CUDA Toolkit 8.0.

As far as I understand it right I only have to change the Tensors that are allocated on my gpu from CudaTensor to CudaHalfTensor and the calculations should be in fp16 precision. However, when I do that I get an error on using the optim.sgd() function that says: "No algorithms found that would fit in free GPU memory".

Am I doing something wrong? Or is fp16 actually supported for a VGG16 model using sgd?

The detailed error message is:

In 1 module of nn.Sequential:
/home/.../torch/install/share/lua/5.1/cudnn/find.lua:469: No algorithms found that would fit in free GPU memory
stack traceback:
	[C]: in function 'error'
	/home/.../torch/install/share/lua/5.1/cudnn/find.lua:469: in function 'forwardAlgorithm'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:189: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./trainManager.lua:103: in function 'opfunc'
	/home/.../torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'

mbcel avatar Feb 17 '17 16:02 mbcel

this is presumably a bug in cudnn 5.1 itself.

soumith avatar Feb 17 '17 17:02 soumith

@marcel1991 : Can you post a snippet of your code where this is happening ? What is your card, exactly? What does cutorch.hasFastHalfInstructions() return ?

Also: try using CUDNN6. A number of FP16 issues fixed there. Check out R6 barnch of this repo and do 'luarocks make cudnn-scm-1.rockspec'.

borisfom avatar Feb 18 '17 00:02 borisfom

@borisfom : So cutorch.hasFastHalfInstructions() returns false. My GPU is a Titan X Pascal.

I tried CUDNN6 with the R6 branch now. It's still not working but now I get a new error that seems to point more to the direction where something is going wrong:

/home/.../torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
/home/.../torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionForwardAlgorithm failed, sizes:  convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA8,3,368,1224 -filtA13,3,3,3 8,13,184,612 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT
stack traceback:
	[C]: in function 'error'
	/home/.../torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'forwardAlgorithm'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:190: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function </home/.../torch/install/share/lua/5.1/nn/ConcatTable.lua:9>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./trainManager.lua:104: in function 'opfunc'
	...

The relevant code is:

batchInputs = torch.CudaHalfTensor()
batchLabels = torch.CudaHalfTensor()

-- function trains one minibatch on module
function TrainManager.trainBatch(self, batchInputsCpu, batchLabelsCpu)
  local waitTime = waitTimer:time().real
  cutorch.synchronize()
  local batchTimer = torch.Timer()

  collectgarbage() -- free unused memory
  cutorch.synchronize()

  local options = self.options

  -- copy data into gpu tensors
  batchInputs:resize(batchInputsCpu:size()):copy(batchInputsCpu)
  batchLabels:resize(batchLabelsCpu:size()):copy(batchLabelsCpu)

  local batchLoss
  -- sgd expects function with input: moduleParameters; output: loss, gradParams
  local opFunction = function(modelParameters)
    model:zeroGradParameters()

    local outputs = model:forward(batchInputs)
    batchLoss = criterion:forward(outputs, batchLabels)
    local gradientOutputs = criterion:backward(outputs, batchLabels)
    model:backward(batchInputs, gradientOutputs)

    -- L2 regularization
    -- ignore to add l2 loss to error due to fair comparison of different l2 settings
    -- batchLoss = batchLoss + optimisationState.regL2 * torch.norm(modelParameters, 2)^2/2
    --gradientParameters:add( modelParameters:clone():mul(optimisationState.regL2) )

    return batchLoss, gradientParameters
  end

  optim.adam(opFunction, modelParameters, optimisationState)

...

The error occurs at the last line when the adam() function is called. The same happens with sgd() function

mbcel avatar Feb 20 '17 21:02 mbcel

Does anyone use CudaHalfTensor successfully with the Titan X Pascal? And if yes, what nvidia driver do you use and which Ubuntu version?

mbcel avatar Apr 10 '17 10:04 mbcel