FlagAI [Question]: aquila-7B OOM

Description

在32G GPU上跑aquila-7B推理的示例代码显示out of memory，请问需要多少显存？其他7B大模型是可以跑的，aquila模型的显存消耗会比较高吗？

Alternatives

No response

Jun 10 '23 06:06 calla212

Same issue here:

Loading model aquila-7b / aquilachat-7b takes at most 107G memory.
After moving the model to CUDA, the program still use ~65G memory.
Inference on 3090 24G always trigger the CUDA OOM error.

My system information:

            .-/+oossssoo+/-.               minerva@worker
        `:+ssssssssssssssssss+:`           --------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.3 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: Super Server 0123456789
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-125-generic
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 145 days, 29 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 756 (dpkg), 5 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1024x768
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/22
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon E5-2690 v4 (56) @ 3.500GHz
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: NVIDIA 83:00.0 NVIDIA Corporation Device 2204
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   GPU: NVIDIA 82:00.0 NVIDIA Corporation Device 2204
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    GPU: NVIDIA 02:00.0 NVIDIA Corporation Device 2204
  +sssssssssdmydMMMMMMMMddddyssssssss+     GPU: NVIDIA 03:00.0 NVIDIA Corporation Device 2204
   /ssssssssssshdmNNNNmyNMMMMhssssss/      Memory: 1940MiB / 257821MiB
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

Jun 10 '23 06:06 huntzhan

我们工程师正在排查这个问题

Jun 10 '23 08:06 ftgreat

fixed。后面我们发个修复版本，到时候您更新下

Jun 10 '23 12:06 ftgreat

Description

在32G GPU上跑aquila-7B推理的示例代码显示out of memory，请问需要多少显存？其他7B大模型是可以跑的，aquila模型的显存消耗会比较高吗？

Alternatives

No response

您从哪儿下载的模型文件？

Jun 10 '23 13:06 hanswang73

是用了这里的代码

Jun 10 '23 15:06 calla212

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

Jun 11 '23 13:06 yinguobing

我在24GB的A5000上运行，也是莫名退出，连OOM错误都不报

Jun 11 '23 13:06 hanswang73

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

Jun 12 '23 02:06 ftgreat

我在24GB的A5000上运行，也是莫名退出，连OOM错误都不报

也是在1.7.1版吗

Jun 12 '23 02:06 ftgreat

我在24GB的A5000上运行，也是莫名退出，连OOM错误都不报

也是在1.7.1版吗

没注意版本，就是前天从github上打包下载的flagai的整个zip文件

Jun 12 '23 02:06 hanswang73

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

代码从这里copy的：https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京在哪儿?"
text = f'{text}' 
print(f"text is {text}")
with torch.no_grad():
    out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
    print(f"pred is {out}")

版本：

torch                       2.0.1+cu118          
flagai                      1.7.1                
bminf                       2.0.1

另外将 from torch._six import inf 替换为 from torch import inf。

耗尽的是CPU RAM，不是GPU RAM。

Jun 12 '23 02:06 yinguobing

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

代码从这里copy的：https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京在哪儿?"
text = f'{text}' 
print(f"text is {text}")
with torch.no_grad():
    out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
    print(f"pred is {out}")

版本：

torch                       2.0.1+cu118          
flagai                      1.7.1                
bminf                       2.0.1

另外将 from torch._six import inf 替换为 from torch import inf。

耗尽的是CPU RAM，不是GPU RAM。

啊？！那需要多少CPU内存？

Jun 12 '23 02:06 hanswang73

fixed。后面我们发个修复版本，到时候您更新下

按照这儿第三步推理的例子运行，还是会出现OOM的问题，40GB内存，V100显卡。

https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

Jun 12 '23 03:06 hazy217

wsl2给了50g内存和64g交换空间显存24g 提示显存不够

Jun 12 '23 07:06 ruolunhui

看起来AquilaChat也有同样的问题。使用代码：https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-chat#1-%E6%8E%A8%E7%90%86inference

复现环境：

python3 -m venv .env
source .env/bin/activate
pip install -i https://mirrors.cloud.tencent.com/pypi/simple flagai
pip install -i https://mirrors.cloud.tencent.com/pypi/simple bminf
# 修正_six不存在的问题： from torch._six import inf 替换为 from torch import inf。
vim /home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/mpu/grads.py

有没有可能是依赖包版本问题？官方能否给一个requirements.txt？

$ pip freeze
absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
async-timeout==4.0.2
attrs==23.1.0
bminf==2.0.1
boto3==1.21.42
botocore==1.24.46
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.4
colorama==0.4.6
cpm-kernels==1.0.11
datasets==2.0.0
diffusers==0.7.2
dill==0.3.6
einops==0.3.0
filelock==3.12.1
flagai==1.7.1
frozenlist==1.3.3
fsspec==2023.6.0
ftfy==6.1.1
google-auth==2.19.1
google-auth-oauthlib==0.4.6
grpcio==1.54.2
huggingface-hub==0.15.1
idna==3.4
importlib-metadata==6.6.0
jieba==0.42.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
lit==16.0.5.post0
lxml==4.9.2
Markdown==3.4.3
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
nltk==3.6.7
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
omegaconf==2.3.0
packaging==23.1
pandas==1.3.5
Pillow==9.5.0
portalocker==2.7.0
protobuf==3.19.6
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.6.5
pytz==2023.3
PyYAML==6.0
regex==2023.6.3
requests==2.31.0
requests-oauthlib==1.3.1
responses==0.18.0
rouge-score==0.1.2
rsa==4.9
s3transfer==0.5.2
sacrebleu==2.3.1
scikit-learn==1.0.2
scipy==1.10.1
sentencepiece==0.1.96
six==1.16.0
sympy==1.12
tabulate==0.9.0
taming-transformers-rom1504==0.0.6
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.12.1
torch==2.0.1
torchmetrics==0.11.4
torchvision==0.15.2
tqdm==4.65.0
transformers==4.20.1
triton==2.0.0
typing-extensions==4.6.3
urllib3==1.26.16
wcwidth==0.2.6
Werkzeug==2.3.6
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

Jun 12 '23 09:06 yinguobing

我试成功了：推理的代码中，加一个device="cuda"的参数，模型会直接加载到GPU（之前是先加载到CPU，我也不知道为啥啊），加载完后，显存占用28GB，清理缓存后，16GB。7b模型。

Jun 12 '23 19:06 hanswang73

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

Jun 13 '23 01:06 yinguobing

可以先清理下cuda cache。

Jun 13 '23 03:06 ftgreat

使用测试脚本部署成服务，每调用一次增加显存，几次之后就回出现oom

Jun 13 '23 08:06 safehumeng

使用测试脚本部署成服务，每调用一次增加显存，几次之后就回出现oom

请问用的是flagai哪个版本？方便看下服务代码么

Jun 13 '23 08:06 ftgreat

使用测试脚本部署成服务，每调用一次增加显存，几次之后就回出现oom

请问用的是flagai哪个版本？方便看下服务代码么

@ftgreat 直接在根目录下跑的，然后分支用的这个

master 0634ab4 Merge pull request #341 from Anhforth/master

服务代码：

import asyncio import websockets import json import numpy as np import os import torch from flagai.auto_model.auto_loader import AutoLoader from flagai.model.predictor.predictor_web import Predictor from flagai.data.tokenizer import Tokenizer import bminf

os.environ['CUDA_VISIBLE_DEVICES'] = '1'

state_dict = "./checkpoints_in" model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader( "lm", model_dir=state_dict, model_name=model_name, use_cache=True) model = loader.get_model() tokenizer = loader.get_tokenizer()

model.eval() model.half() model.cuda()

predictor = Predictor(model, tokenizer)

def default_dump(obj): """Convert numpy classes to JSON serializable objects.""" if isinstance(obj, (np.integer, np.floating, np.bool_)): return obj.item() elif isinstance(obj, np.ndarray): return obj.tolist() else: return obj

async def main_logic(websocket, path): data = await websocket.recv() request_json = json.loads(data) print(request_json) query = request_json["prompt"] use_stream = request_json["stream"] if "stream" in request_json else False max_length = request_json["maxTokens"] if "maxTokens" in request_json else 320 top_k = request_json["topK"] if "topK" in request_json else 50 temperature = request_json["temperature"] if "temperature" in request_json else 0.95 top_p = request_json["topP"] if "topP" in request_json else 0.7 do_sample = request_json["useRandom"] if "useRandom" in request_json else False logprobs = request_json["logprobs"] if "logprobs" in request_json else 0 with torch.autocast("cuda"): g_index = 0 for re_data in predictor.predict_generate_randomsample(query, total_max_length=max_length, top_k=top_k, top_p=top_p, temperature=temperature, prompts_tokens=None): print(re_data) # await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump)) if "result" in re_data: re_data["result"]["index"] = g_index # await websocket.send(re_data.lstrip("~~").rstrip("~~")) if re_data["finish"]: await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump)) break else: if use_stream and re_data["usage"]["totalTokens"] % 5 == 0 and re_data["usage"]["totalTokens"] >= 20: await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump)) g_index += 1 await websocket.send("close")

async def start_server(): server = await websockets.serve(main_logic, '0.0.0.0', 17862) await server.wait_closed()

if name == "main": asyncio.get_event_loop().run_until_complete(start_server()) asyncio.get_event_loop().run_forever()

其中引用的方法的return改成了yield

每次增加1G左右显存

Jun 13 '23 09:06 safehumeng

no_grad

我觉得predict部分需要 no_grad 包一下，不然会增加显存。

Jun 13 '23 10:06 ftgreat

no_grad
我觉得predict部分需要 no_grad 包一下，不然会增加显存。

好的，谢啦，我把方法加了@torch.no_grad()注解，不会增加了

Jun 13 '23 10:06 safehumeng

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

可以试试flagai 1.7.2 ，内存32G，显存16G（包括模型+一条2048tokens）

Jun 13 '23 10:06 ftgreat

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
羊驼系7B是没有问题的。
可以试试flagai 1.7.2 ，内存32G，显存16G（包括模型+一条2048tokens）

感谢回复！

升级到1.7.2以后，RTX 3090还是会报GPU OOM错误。

[2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
  File "chat.py", line 10, in <module>
    loader = AutoLoader(
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
    self.model = getattr(LazyImport(self.model_name[0]),
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
    return load_local(checkpoint_path, only_download_config=only_download_config)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
    model.to(device)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用代码：

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate

state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    device='cuda')

model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京为什么是中国的首都？"

def pack_obj(text):
    obj = dict()
    obj['id'] = 'demo'

    obj['conversations'] = []
    human = dict()
    human['from'] = 'human'
    human['value'] = text
    obj['conversations'].append(human)
    # dummy bot
    bot = dict()
    bot['from'] = 'gpt'
    bot['value'] = ''
    obj['conversations'].append(bot)

    obj['instruction'] = ''

    return obj

def delete_last_bot_end_singal(convo_obj):
    conversations = convo_obj['conversations']
    assert len(conversations) > 0 and len(conversations) % 2 == 0
    assert conversations[0]['from'] == 'human'

    last_bot = conversations[len(conversations)-1]
    assert last_bot['from'] == 'gpt'

    ## from _add_speaker_and_signal
    END_SIGNAL = "\n"
    len_end_singal = len(END_SIGNAL)
    len_last_bot_value = len(last_bot['value'])
    last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
    return

def convo_tokenize(convo_obj, tokenizer):
    chat_desc = convo_obj['chat_desc']
    instruction = convo_obj['instruction']
    conversations = convo_obj['conversations']
            
    # chat_desc
    example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
    EOS_TOKEN = example[-1]
    example = example[:-1] # remove eos
    # instruction
    instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
    instruction = instruction[1:-1] # remove bos & eos
    example += instruction

    for conversation in conversations:
        role = conversation['from']
        content = conversation['value']
        print(f"role {role}, raw content {content}")
        content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
        content = content[1:-1] # remove bos & eos
        print(f"role {role}, content {content}")
        example += content
    return example

print('-'*80)
print(f"text is {text}")

from cyg_conversation import default_conversation

conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)

tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]

with torch.no_grad():
    out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
    print(f"pred is {out}")

另外，上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错：

Traceback (most recent call last):
  File "chat.py", line 4, in <module>
    from flagai.model.predictor.predictor import Predictor
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
    from .aquila import aquila_generate
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
    def aquila_generate(
    ^
SyntaxError: duplicate argument 'top_k' in function definition

文件flagai/model/predictor/aquila.py第14行重复了一个参数：

def aquila_generate(
        tokenizer,
        model,
        prompts: List[str],
        max_gen_len: int,
        temperature: float = 0.8,
        top_k: int = 30,
        top_p: float = 0.95,
        top_k: int = 30, # 重复的参数
        prompts_tokens: List[List[int]] = None,
    ) -> List[str]:
    ...

Jun 14 '23 01:06 yinguobing

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
羊驼系7B是没有问题的。
可以试试flagai 1.7.2 ，内存32G，显存16G（包括模型+一条2048tokens）

感谢回复！

升级到1.7.2以后，RTX 3090还是会报GPU OOM错误。

[2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
  File "chat.py", line 10, in <module>
    loader = AutoLoader(
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
    self.model = getattr(LazyImport(self.model_name[0]),
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
    return load_local(checkpoint_path, only_download_config=only_download_config)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
    model.to(device)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用代码：

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate

state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    device='cuda')

model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京为什么是中国的首都？"

def pack_obj(text):
    obj = dict()
    obj['id'] = 'demo'

    obj['conversations'] = []
    human = dict()
    human['from'] = 'human'
    human['value'] = text
    obj['conversations'].append(human)
    # dummy bot
    bot = dict()
    bot['from'] = 'gpt'
    bot['value'] = ''
    obj['conversations'].append(bot)

    obj['instruction'] = ''

    return obj

def delete_last_bot_end_singal(convo_obj):
    conversations = convo_obj['conversations']
    assert len(conversations) > 0 and len(conversations) % 2 == 0
    assert conversations[0]['from'] == 'human'

    last_bot = conversations[len(conversations)-1]
    assert last_bot['from'] == 'gpt'

    ## from _add_speaker_and_signal
    END_SIGNAL = "\n"
    len_end_singal = len(END_SIGNAL)
    len_last_bot_value = len(last_bot['value'])
    last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
    return

def convo_tokenize(convo_obj, tokenizer):
    chat_desc = convo_obj['chat_desc']
    instruction = convo_obj['instruction']
    conversations = convo_obj['conversations']
            
    # chat_desc
    example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
    EOS_TOKEN = example[-1]
    example = example[:-1] # remove eos
    # instruction
    instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
    instruction = instruction[1:-1] # remove bos & eos
    example += instruction

    for conversation in conversations:
        role = conversation['from']
        content = conversation['value']
        print(f"role {role}, raw content {content}")
        content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
        content = content[1:-1] # remove bos & eos
        print(f"role {role}, content {content}")
        example += content
    return example

print('-'*80)
print(f"text is {text}")

from cyg_conversation import default_conversation

conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)

tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]

with torch.no_grad():
    out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
    print(f"pred is {out}")

另外，上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错：

Traceback (most recent call last):
  File "chat.py", line 4, in <module>
    from flagai.model.predictor.predictor import Predictor
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
    from .aquila import aquila_generate
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
    def aquila_generate(
    ^
SyntaxError: duplicate argument 'top_k' in function definition

文件flagai/model/predictor/aquila.py第14行重复了一个参数：

def aquila_generate(
        tokenizer,
        model,
        prompts: List[str],
        max_gen_len: int,
        temperature: float = 0.8,
        top_k: int = 30,
        top_p: float = 0.95,
        top_k: int = 30, # 重复的参数
        prompts_tokens: List[List[int]] = None,
    ) -> List[str]:
    ...

今天会发版本修复。

Jun 14 '23 05:06 ftgreat

更新1.7.3，同时使用FP16精度后，在RTX3090上运行成功。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8               32W / 350W|  15283MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1955360      C   python3                                   15280MiB |
+---------------------------------------------------------------------------------------+

使用FP16精度：

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    fp16=True)

Jun 14 '23 09:06 yinguobing

先关闭issue，如有问题请再打开。谢谢

Jun 19 '23 07:06 ftgreat