BELLE 显存不够：CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC

显存不够：CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Open zhangzai666 opened this issue 2 years ago • 6 comments

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

Apr 09 '23 09:04 zhangzai666

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

如果你是非LoRA训练，那么40G是不够的。非LoRA训练，最长长度设置为1024，需要在80G的A100上才能跑起来7B以上的模型。或者deepspeed设置cpu offload，但是训练的就特别慢

Apr 09 '23 10:04 xianghuisun

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

如果你是非LoRA训练，那么40G是不够的。非LoRA训练，最长长度设置为1024，需要在80G的A100上才能跑起来7B以上的模型。或者deepspeed设置cpu offload，但是训练的就特别慢

您好，感谢您的解答。我是4张A100的。每张40G。您的意思是要4张80G的才可以么。我最大长度已经减少到128，per_device_train_batch_size=1.

Apr 09 '23 10:04 zhangzai666

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

如果你是非LoRA训练，那么40G是不够的。非LoRA训练，最长长度设置为1024，需要在80G的A100上才能跑起来7B以上的模型。或者deepspeed设置cpu offload，但是训练的就特别慢

您好，请问您是用了几张A100 80G的卡呢，我这边是有4张40G的A100 ，然后cutoff_len从1024减少到了128

Apr 09 '23 11:04 zhangzai666

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

如果你是非LoRA训练，那么40G是不够的。非LoRA训练，最长长度设置为1024，需要在80G的A100上才能跑起来7B以上的模型。或者deepspeed设置cpu offload，但是训练的就特别慢

您好，请问您是用了几张A100 80G的卡呢，我这边是有4张40G的A100 ，然后cutoff_len从1024减少到了128

我们是8张，不建议您把长度减少到128，因为LlamaTokenizer后的input_ids长度是要比正常的句子长度长很多的。建议您尝试cpu offload，就是速度会比较慢。

Apr 09 '23 11:04 xianghuisun

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

如果你是非LoRA训练，那么40G是不够的。非LoRA训练，最长长度设置为1024，需要在80G的A100上才能跑起来7B以上的模型。或者deepspeed设置cpu offload，但是训练的就特别慢

您好，请问您是用了几张A100 80G的卡呢，我这边是有4张40G的A100 ，然后cutoff_len从1024减少到了128

我们是8张，不建议您把长度减少到128，因为LlamaTokenizer后的input_ids长度是要比正常的句子长度长很多的。建议您尝试cpu offload，就是速度会比较慢。

感谢您的回答。根据您的建议我把deepspeed.json文件设置为： "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, 然后报错： No modifications detected for re-loaded extension module utils, skipping build step..

Apr 09 '23 11:04 zhangzai666

CUDA out of memory. Tried to allocate 6.28 GiB (GPU 1; 39.45 GiB total capacity; 31.41 GiB already allocated; 5.99 GiB free; 31.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 4张A10040G的卡，还是报显存不够。请问设备条件是什么多谢

如果你是非LoRA训练，那么40G是不够的。非LoRA训练，最长长度设置为1024，需要在80G的A100上才能跑起来7B以上的模型。或者deepspeed设置cpu offload，但是训练的就特别慢

您好，请问您是用了几张A100 80G的卡呢，我这边是有4张40G的A100 ，然后cutoff_len从1024减少到了128

我们是8张，不建议您把长度减少到128，因为LlamaTokenizer后的input_ids长度是要比正常的句子长度长很多的。建议您尝试cpu offload，就是速度会比较慢。

感谢您的回答。根据您的建议我把deepspeed.json文件设置为： "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, 然后报错： No modifications detected for re-loaded extension module utils, skipping build step..

这个有可能是编译动态库时编译失败，不是deepspeed.json的问题

Apr 09 '23 11:04 xianghuisun

BELLE BELLE copied to clipboard

BELLE
BELLE copied to clipboard