cifar.torch
cifar.torch copied to clipboard
Out of Memory Issues when Training
I'm following the steps as described in this blog entry in order to run the CIFAR classification. Preprocessing Provider.lua
works fine, but training won't seem to work due to memory problems.
When running the regular command line CUDA_VISIBLE_DEVICES=0 th train.lua
, I'm getting the following output:
{
learningRate : 1
momentum : 0.9
epoch_step : 25
learningRateDecay : 1e-07
batchSize : 128
model : "vgg_bn_drop"
save : "logs"
weightDecay : 0.0005
backend : "nn"
max_epoch : 300
}
==> configuring model
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9668/cutorch/lib/THC/generic/THCStorage.cu line=41 error=2 : out of memory
/Users/artcfa/torch/install/bin/luajit: /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9668/cutorch/lib/THC/generic/THCStorage.cu:41
stack traceback:
[C]: in function 'resize'
/Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
/Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
/Users/artcfa/torch/install/share/lua/5.1/nn/Module.lua:126: in function 'type'
/Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
/Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
/Users/artcfa/torch/install/share/lua/5.1/nn/Module.lua:126: in function 'cuda'
train.lua:47: in main chunk
[C]: in function 'dofile'
...edja/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x0107712bd0
So apparently CUDA reports to be out of memory. I've compiled the NVIDIA CUDA samples and ran the deviceQuery
sample in order to get some stats on this:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 650M"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1024 MBytes (1073414144 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 900 MHz (0.90 GHz)
Memory Clock rate: 2508 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 650M
Result = PASS
So I thought maybe 1 GB of total available CUDA memory simply wasn't enough. I modified the sample code so it could run on the CPU without using CUDA (which you can find here) and now it starts up and begins training.
However, after around 9 / 390 training samples, it crashes with the message luajit not enough memory
, so that doesn't appear to work either.
Am I doing something wrong? What can I do to run this?
try his branch: https://github.com/szagoruyko/cifar.torch/tree/cpu
Tried it, same result as with my custom cpu code linked above. I started it with the following command line:
CUDA_VISIBLE_DEVICES=0 th train.lua --type float
Resulting in the following output:
{
learningRate : 1
type : "float"
momentum : 0.9
epoch_step : 25
learningRateDecay : 1e-07
batchSize : 128
model : "vgg_bn_drop"
save : "logs"
backend : "nn"
weightDecay : 0.0005
max_epoch : 300
}
==> configuring model
nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.BatchFlip
(2): nn.Copy
(3): nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> (43) -> (44) -> (45) -> (46) -> (47) -> (48) -> (49) -> (50) -> (51) -> (52) -> (53) -> (54) -> output]
(1): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.Dropout(0.300000)
(5): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
(6): nn.SpatialBatchNormalization
(7): nn.ReLU
(8): nn.SpatialMaxPooling(2x2, 2,2)
(9): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
(10): nn.SpatialBatchNormalization
(11): nn.ReLU
(12): nn.Dropout(0.400000)
(13): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
(14): nn.SpatialBatchNormalization
(15): nn.ReLU
(16): nn.SpatialMaxPooling(2x2, 2,2)
(17): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
(18): nn.SpatialBatchNormalization
(19): nn.ReLU
(20): nn.Dropout(0.400000)
(21): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(22): nn.SpatialBatchNormalization
(23): nn.ReLU
(24): nn.Dropout(0.400000)
(25): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(26): nn.SpatialBatchNormalization
(27): nn.ReLU
(28): nn.SpatialMaxPooling(2x2, 2,2)
(29): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
(30): nn.SpatialBatchNormalization
(31): nn.ReLU
(32): nn.Dropout(0.400000)
(33): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(34): nn.SpatialBatchNormalization
(35): nn.ReLU
(36): nn.Dropout(0.400000)
(37): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(38): nn.SpatialBatchNormalization
(39): nn.ReLU
(40): nn.SpatialMaxPooling(2x2, 2,2)
(41): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(42): nn.SpatialBatchNormalization
(43): nn.ReLU
(44): nn.Dropout(0.400000)
(45): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(46): nn.SpatialBatchNormalization
(47): nn.ReLU
(48): nn.Dropout(0.400000)
(49): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(50): nn.SpatialBatchNormalization
(51): nn.ReLU
(52): nn.SpatialMaxPooling(2x2, 2,2)
(53): nn.View(512)
(54): nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
(1): nn.Dropout(0.500000)
(2): nn.Linear(512 -> 512)
(3): nn.BatchNormalization
(4): nn.ReLU
(5): nn.Dropout(0.500000)
(6): nn.Linear(512 -> 10)
}
}
}
==> loading data
Will save at logs
==> setting criterion
==> configuring optimizer
==> online epoch # 1 [batchSize = 128]
/Users/artcfa/torch/install/bin/luajit: not enough memory..................] ETA: 30m18s | Step: 5s52ms
The error happens around 10/390, long before the first epoch has finished training.
Hi, i had the same problem on a Macbook Pro when I tried
th train.lua --type=float
The problem is related to luajit, which cannot allocate that much memory. So I installed torch with the following options:
TORCH_LUA_VERSION=LUA51 ./install.sh
which solved the problem. This will result in torch using lua instead of luajit.
nvidia-smi
Sometimes a process may occupy too much GPU memory, you can try to kill this process. I have encountered /usr/lib/xorg/Xorg
occupies around 1800 Mb GPU memory.