img_classification_pk_pytorch icon indicating copy to clipboard operation
img_classification_pk_pytorch copied to clipboard

Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12)

Open ZhenyF opened this issue 7 years ago • 5 comments

HI, I tried the densenet you recommanded and set the grouth_rate=12, depth=100 and batch_size=128 on two GTX1080ti. It seems that the model will stop after a epoch. Could you please help me with this?

ZhenyF avatar Jun 07 '18 04:06 ZhenyF

Hi @ZhenyF, The suggest batch_size is 64 (the same as the DenseNet paper). It should use about 2.7GB. Here is the suggest command:

python3 main.py --arch densenet --depth 100 --growth-rate 12 --bn-size 4 --compression 0.5 --data cifar10+ --epochs 300 --save save/cifar10+-densenet-bc-100

I also tried batch_size 128, which used about 5.0 GB. I believe it should be able to fit into a GTX1080ti.

If it still doesn't work you may try this memory efficient implementation by my friend Geoff.

felixgwu avatar Jun 07 '18 06:06 felixgwu

Many thanks for the reply! @felixgwu Just out of curiosity, if it just stopped because of a larger batch size, why it can run still be trained by an epoch? I checked my two GPUs' memory and find out that only 67% are opccupied during the first epoch training.(I tried the largest densenet BC(grouth-rate=40 and depth = 190,with batch-size=64)and it just stoped at the very first beginning) And another question is that I tried the memory recommanded efficient implementation model. When I set efficient to True (memory efficient mode)it will output this and never start training but when I set it to False it runs as usual

`(pytorch) D:\GA\PYTorch\img_classification_pk_pytorch-master>python main.py --data cifar10+ --depth 100 --save save/cifar10+-densenetBC12_100 --arch densenet_eff [31mWARNING: you don't have tesnorboard_logger installed[39m => creating model 'densenet_eff' Create DenseNet-BC100 for cifar10+ loading cifar10+ {'augmentation': True, 'num_classes': 10} with data augmentation Files already downloaded and verified create folder: [32msave/cifar10+-densenetBC12_100[39m args: Namespace(alpha=0.99, arch='densenet_eff', batch_size=128, beta1=0.9, beta2=0.999, bn_size=4, compression=0.5, config_of_data={'augmentation': True, 'num_classes': 10}, data='cifar10+', data_root='Z:\Datasets\CIFAR_10_dataset', death_mode='none', death_rate=0.5, decay_rate=0.1, depth=100, drop_rate=0.0, epochs=300, evaluate='', force=False, growth_rate=12, lr=0.1, momentum=0.9, nesterov=False, normalized=False, num_classes=10, num_workers=4, optimizer='sgd', patience=0, print_freq=100, resume='', save='save/cifar10+-densenetBC12_100', seed=0, start_epoch=1, tensorboard=False, trainer='train', use_validset=True, weight_decay=0.0001)

of params: 769162

Epoch 1 lr = 1.000000e-01 D:\GA\PYTorch\img_classification_pk_pytorch-master\train.py:47: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number losses.update(loss.data[0], input.size(0)) D:\PYTorch\img_classification_pk_pytorch-master\train.py:48: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number top1.update(err1[0], input.size(0)) D:\PYTorch\img_classification_pk_pytorch-master\train.py:49: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number top5.update(err5[0], input.size(0)) D:\Anaconda3\envs\pytorch\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support warnings.warn('PyTorch is not compiled with NCCL support')`

ZhenyF avatar Jun 07 '18 09:06 ZhenyF

Hi @ZhenyF,

It seems that you're using PyTorch windows version. Would it be possible that it's a bug for the windows version?

taineleau-zz avatar Jun 13 '18 18:06 taineleau-zz

Hi @taineleau I am not sure if it is caused by the difference between OS. Another problem is that I cannot reach even a similar accuracy using densenet40. I can only got 6.0%(minimum 5.7%), but 5.44% on Tensorlfow. Is it possible caused by Pytorch or it is caused by my implementation error?

ZhenyF avatar Jun 15 '18 00:06 ZhenyF

Hi @ZhenyF, Did you notice that we hold out a portion of training data as validation set?

taineleau-zz avatar Jun 23 '18 13:06 taineleau-zz