BELLE
BELLE copied to clipboard
显存不够:CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
如果你是非LoRA训练,那么40G是不够的。 非LoRA训练,最长长度设置为1024,需要在80G的A100上才能跑起来7B以上的模型。 或者deepspeed设置cpu offload,但是训练的就特别慢
CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
如果你是非LoRA训练,那么40G是不够的。 非LoRA训练,最长长度设置为1024,需要在80G的A100上才能跑起来7B以上的模型。 或者deepspeed设置cpu offload,但是训练的就特别慢
您好,感谢您的解答。我是4张A100的。每张40G。您的意思是要4张80G的才可以么。我最大长度已经减少到128,per_device_train_batch_size=1.
CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
如果你是非LoRA训练,那么40G是不够的。 非LoRA训练,最长长度设置为1024,需要在80G的A100上才能跑起来7B以上的模型。 或者deepspeed设置cpu offload,但是训练的就特别慢
您好,请问您是用了几张A100 80G的卡呢,我这边是有4张40G的A100 ,然后cutoff_len从1024减少到了128
CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
如果你是非LoRA训练,那么40G是不够的。 非LoRA训练,最长长度设置为1024,需要在80G的A100上才能跑起来7B以上的模型。 或者deepspeed设置cpu offload,但是训练的就特别慢
您好,请问您是用了几张A100 80G的卡呢,我这边是有4张40G的A100 ,然后cutoff_len从1024减少到了128
我们是8张,不建议您把长度减少到128,因为LlamaTokenizer后的input_ids长度是要比正常的句子长度长很多的。 建议您尝试cpu offload,就是速度会比较慢。
CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
如果你是非LoRA训练,那么40G是不够的。 非LoRA训练,最长长度设置为1024,需要在80G的A100上才能跑起来7B以上的模型。 或者deepspeed设置cpu offload,但是训练的就特别慢
您好,请问您是用了几张A100 80G的卡呢,我这边是有4张40G的A100 ,然后cutoff_len从1024减少到了128
我们是8张,不建议您把长度减少到128,因为LlamaTokenizer后的input_ids长度是要比正常的句子长度长很多的。 建议您尝试cpu offload,就是速度会比较慢。
感谢您的回答。根据您的建议我把deepspeed.json文件设置为:
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
然后报错:
No modifications detected for re-loaded extension module utils, skipping build step..

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡,还是报显存不够。请问设备条件是什么多谢
如果你是非LoRA训练,那么40G是不够的。 非LoRA训练,最长长度设置为1024,需要在80G的A100上才能跑起来7B以上的模型。 或者deepspeed设置cpu offload,但是训练的就特别慢
您好,请问您是用了几张A100 80G的卡呢,我这边是有4张40G的A100 ,然后cutoff_len从1024减少到了128
我们是8张,不建议您把长度减少到128,因为LlamaTokenizer后的input_ids长度是要比正常的句子长度长很多的。 建议您尝试cpu offload,就是速度会比较慢。
感谢您的回答。根据您的建议我把deepspeed.json文件设置为: "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, 然后报错: No modifications detected for re-loaded extension module utils, skipping build step..
这个有可能是编译动态库时编译失败,不是deepspeed.json的问题