Generative_MLZSL icon indicating copy to clipboard operation
Generative_MLZSL copied to clipboard

Why does memory keep increasing during training?

Open hugh920 opened this issue 3 years ago • 10 comments

Dear author, thanks for your code.But when I reproduced this code, I found that the memory kept increasing, and finally caused the training failure of running out of memory.What is the possible reason?

hugh920 avatar Apr 05 '22 01:04 hugh920

Likewise, trying to find a fix now.

My process keeps getting killed, even when running with 25GB RAM on Google Colab.

dgbarclay avatar Apr 05 '22 09:04 dgbarclay

@hugh920 could you let me know if you find a fix in the meantime 👍

dgbarclay avatar Apr 05 '22 15:04 dgbarclay

@dgbarclay When you train, does your memory increase with each epoch?How many epochs have you reached so far?

hugh920 avatar Apr 06 '22 03:04 hugh920

@hugh920 Mine is being killed whilst parsing the data, it doesn't reach the beginning of training. It seems to fall within the block on line 203 of util.py. Are you able to begin training? Have you modified the code?

dgbarclay avatar Apr 06 '22 09:04 dgbarclay

@dgbarclay Mine can be trained without modifying the code. However, due to increased memory, it failed in the second round. I modified batch_size and made the model structure a little simpler so that he could continue to run. I noticed that the memory increased during the first two training rounds and stabilized after the third. I don't understand why.

hugh920 avatar Apr 07 '22 07:04 hugh920

@hugh920 okay, I have not yet made it that far. I was running out of memory during forming the DataLoader so I'm having to refactor a little bit. Are you able to push your version so I can compare the two? It would help me out loads, cheers.

dgbarclay avatar Apr 07 '22 14:04 dgbarclay

@hugh920 are you able to run eval_nus_wide.sh without failure? I ultimately just need to be able to run this model to take image queries and give predictions, are you able to get the model in that state?

dgbarclay avatar Apr 07 '22 18:04 dgbarclay

@dgbarclay I took the ALF out and just used FLF, which didn't work well. It may not be what you need.

hugh920 avatar Apr 10 '22 07:04 hugh920

@dgbarclay I also had a problem with processes being killed while loading data on other projects today. I have observed that the GPU is not utilized when loading data. It is a dataloader made by CPU, probably because the processing power of CPU is not up to it.It has nothing to do with memory size or GPU.

hugh920 avatar Apr 12 '22 12:04 hugh920

Is the issue solved?

akshitac8 avatar Jun 21 '22 09:06 akshitac8