Out of Memory after training a few epochs

Open waylonli opened this issue 2 years ago • 0 comments

The code I'm using is in file "one_file_ref". I was trying to apply Mistral Transformer on other non-text tubular data. I initialised "positions" as torch.arange(1, num_of_most_instances) where "num_of_most_instances" is equivalent to the number of tokens in the longest sequence. However, I have observed that each time I called loss.backward() and enter the next batch, there would be 30mb of gpu memory which could not be released. Thus, after 1000 steps it took 30gb of gpu memory.

Also I found that it always entered line 131 and never went into the "else" branch with my initialised "positions". Is there any mistake of my usage of "positions"? Though the issue does not happen again after I comment out all the codes related to self.cache, I'm wondering if that will affect the attention mechanism.

Sep 30 '23 00:09 waylonli