CoLLiE Llama2 70B 训练报错

Llama2 70B 训练报错

Open xiaopqr opened this issue 10 months ago • 3 comments

使用最新 dev分支代码训练 llama2 70B ，存在以下问题： │collie/collie/models/llama/model.py:203 in _forward │ │ │ │ 200 │ │ │ │ │ │ │ .permute(0, 2, 1, 4, 3) \ │ │ 201 │ │ │ │ │ │ │ .reshape(batch_size, self.num_key_value_heads, │ │ 202 │ │ │ │ │ │ │ │ │ seq_len + start_pos, -1) │ │ ❱ 203 │ │ │ new_layer_past = torch.stack((present_key, value.permute([0, 2, 1, 3])), dim │ │ 204 │ │ attention_mask = attention_mask if attention_mask is not None else torch.ones((q │ │ 205 │ │ if self.config.use_flash: │ │ 206 │ │ │ output = flash_attention(query, key, value, attention_mask)
RuntimeError: stack expects each tensor to be equal size, but got [1, 8, 2048, 1024] at entry 0 and [1, 64, 2048, 128] at entry 1

上面是一个问题，还有一个问题是前几天的 dev分支代码， trainer.save_model，llama2 70B（8张V100, 可以训练）会出现显存 OOM，按道理能跑训练，不应该显存不够，最新dev代码可能还有这个问题，只是还没跑到就报错了 │ │ │ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py:1553 in │ │ _allgather_params_coalesced │ │ │ │ 1550 │ │ allgather_params = [] │ │ 1551 │ │ for psize in partition_sizes: │ │ 1552 │ │ │ tensor_size = psize * self.num_partitions │ │ ❱ 1553 │ │ │ flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=sel │ │ 1554 │ │ │ flat_tensor.requires_grad = False │ │ 1555 │ │ │ allgather_params.append(flat_tensor) │ │ 1556 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 7; 31.75 GiB total capacity; 29.60 GiB already allocated; 312.75 MiB free; 29.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@00INDEX 方便看一下吗？

Aug 14 '23 08:08 xiaopqr

CoLLiE CoLLiE copied to clipboard

Llama2 70B 训练报错

CoLLiE
CoLLiE copied to clipboard