tacotron icon indicating copy to clipboard operation
tacotron copied to clipboard

OOM out of memory when training step.

Open luocanrao opened this issue 7 years ago • 11 comments

my memory storage is about 3GB。how can i fix this problem?

dmesg command info is as belows: Out of memory: Kill process 10192 (python) score 398 or sacrifice child Killed process 10192, UID 1801, (python) total-vm:7804984kB, anon-rss:2460052kB, file-rss:284kB

luocanrao avatar Feb 06 '18 08:02 luocanrao

Preprocessing might be a solution. Save inputs to disk as numpy arrays. Or adjust the hyperprams.

Kyubyong avatar Feb 06 '18 09:02 Kyubyong

I have adjust batch_size = 8, but still OOM. train status output is as below 21%|######2 | 351/1637 [42:49<2:36:55, 7.32s/b] I feel a little strange, why the memory continue increate even the train step is about 21%。

luocanrao avatar Feb 06 '18 09:02 luocanrao

I change a machine to train,memory storage is about 8GB, but still OOM。

luocanrao avatar Feb 07 '18 02:02 luocanrao

@luocanrao hello ,I also meet such problems,I even change batch_size to 4.Do you solve it ?

sunnnnnnnny avatar Mar 09 '18 01:03 sunnnnnnnny

when i changed batch_size to 8 that i have solved the OOM problem.

qizhao1 avatar Mar 13 '18 03:03 qizhao1

@qizhao1 How many epochs did you train and how long did it take?

sunnnnnnnny avatar Mar 13 '18 08:03 sunnnnnnnny

@18810319795 when i changed batch_size to 8 ,the train is running to 20K still is OOM ,so i haved changed batch_size to 4, the n_iter still is 50 , i haved solved the OOM , but the alignment have some problems.

qizhao1 avatar Mar 14 '18 00:03 qizhao1

@qizhao1 How long have you been training to 20K? And are you training on the full dataset? And can tell your GPU memory size, I would like to compare.

sunnnnnnnny avatar Mar 14 '18 01:03 sunnnnnnnny

@18810319795 almost one hour to 20K , training on the all dataset . total memory is 7.92G , free memory is 7.42G.

qizhao1 avatar Mar 14 '18 01:03 qizhao1

@qizhao1 Ok,I sess it .thank you.

sunnnnnnnny avatar Mar 14 '18 01:03 sunnnnnnnny

@qizhao1 Amazing.I also use the full dataset, batch_size=4, n_iter=50,total memory is 12G , free memory is 4G,but almost one hour to 2K.I am confused.

sunnnnnnnny avatar Mar 17 '18 13:03 sunnnnnnnny