OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

VRAM usage not constant in v2.0.0rc1

Open JOHW85 opened this issue 4 years ago • 3 comments

Using the same (Transformer Big) model parameters in 1.2.0 and 2.0.0rc1, my 3090 24GB will run out of memory at random times during the training process (uses from 75% to more than 100% of 24GB), even when I try to reduce the batch size.

In 1.2.0, the training uses a constant amount of vram (12GB/24GB)

Also, speedwise, 1.2.0 seems faster when training (~10500/14000 tok/s vs ~9000/11000 tok/s)

See snippet for parameters: 2.0.0rc1

world_size: 1
gpu_ranks: [0]
queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 1
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]
model_dtype: "fp32"
optim: "adam"
learning_rate: 2
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 16
rnn_size: 1024
word_vec_size: 1024
transformer_ff: 4096
dropout_steps: [0]
dropout: [0.3]
attention_dropout: [0.1]

vs 1.2.0

        --encoder_type transformer --decoder_type transformer --position_encoding \
        --train_steps 300000  --max_generator_batches 2 --dropout 0.1 \
        --batch_size 4096 --batch_type tokens --normalization tokens  --accum_count 2 \
        --optim adam --adam_beta2 0.998 --decay_method noam --warmup_steps 8000 --learning_rate 2 \
        --max_grad_norm 0 --param_init 0  --param_init_glorot \
        --label_smoothing 0.1 --valid_steps 10000 --save_checkpoint_steps 10000 \
        --world_size 1 --gpu_ranks 0

Not sure if the VRAM usage is more constrained because of sharding (which v2 doesn't seem to use).

JOHW85 avatar Oct 01 '20 05:10 JOHW85

Hi there,

VRAM usage should actually be a bit lower with 2.0.You might want to try the filtertoolong transform to filter out long examples. (This was done at preprocessing before.)

As for performance, long examples may also not help, but that's probably not all. What CPU are you using with your 3090?

francoishernandez avatar Oct 01 '20 06:10 francoishernandez

@JOHW85 any update on this? I would be keen to have more details about potential issues with upcoming hardware.

francoishernandez avatar Nov 09 '20 16:11 francoishernandez

Seems like the variable ram usage is due to not including filtertoolong

I'll check later to see if the speed difference between 1.2 and 2.0 is still present.

JOHW85 avatar Jan 19 '21 09:01 JOHW85