minimind icon indicating copy to clipboard operation
minimind copied to clipboard

模型大小与显存

Open ChangFengPoLang opened this issue 8 months ago • 6 comments

感谢这么优秀的项目。请教一个问题:为什么训练过程中,25M参数量的小模型,两个4090D就爆满了,将层数增加到12层就显示: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 800.00 MiB. GPU 0 has a total capacity of 23.63 GiB of which 540.62 MiB is free. Including non-PyTorch memory, this process has 22.40 GiB memory in use. Of the allocated memory 21.84 GiB is allocated by PyTorch, and 37.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 这个意思就是显存不够吧,更不用说104M的模型了。但是问了deepseek,说单张4090D训练1.3B参数量的模型都没有问题(显存占用12G)。所以不知道为什么2*4090D只能训练25M的小模型。

ChangFengPoLang avatar Feb 13 '25 14:02 ChangFengPoLang