TigerBot
TigerBot copied to clipboard
开源的模型里哪个是100K上下文的版本?
开源的模型里哪个是100K上下文的版本?我看了最新的v6版本好像是只有8k吧?
v6两个模型都支持100k, config里的数值只是placeholder, 我们测试用8x 40g 卡支持到总长度100k,没有问题,如果8x80g, 可支持200k,如下命令可测试:
可以根据实际硬件情况调整max_input/generate_length
export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-70b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 37888 --max_generate_length 62112
我这里只有9块3090的显卡,请问有v6量化版的么?量化版的是不是可以支持100K?
70b chat v6的量化版本在这里:https://huggingface.co/TigerResearch/tigerbot-70b-chat-v6-4bit-exl2
谢谢,4bit量化的也可以推理100K的上下文吧?
可以的
请问exllama2量化的模型用什么框架推理可以使用api接口?
量化模型使用下面参数启动模型 export PYTHONPATH='./' ; CUDA_VISIBLE_DEVICES=1,2,3,4 streamlit run apps/exllamav2_web_demo.py -- --model_path /data/model/tigerbot-70b-chat-v6-4bit-exl2/tigerbot --max_input_length 37888 --max_generate_length 62112
在长文本推理的时候会报下面的错误
Truncation was not explicitly activated but max_length
is provided a specific value, please use truncation=True
to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation
.
Both max_new_tokens
(=62112) and max_length
(=100000) seem to have been set. max_new_tokens
will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Exception in thread Thread-10 (eval_generate):
Traceback (most recent call last):
File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 214, in eval_generate
model.generate(**args)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
result = self._greedy_search(
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
outputs = self(
File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 112, in call
self.ex_model.forward(
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/exllamav2/model.py", line 662, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward
13B模型使用下面参数启动模型 export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 ; streamlit run apps/web_demo.py -- --model_path /data/model/tigerbot-13b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 10240 --max_generate_length 10240
在长文本推理的时候会报下面的错误
Namespace(model_path='/data/model/tigerbot-13b-chat-v6', rope_scaling='yarn', rope_factor=8.0, max_input_length=10240, max_generate_length=10240)
Truncation was not explicitly activated but max_length
is provided a specific value, please use truncation=True
to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation
.
{'input_ids': tensor([[65107, 9134, 421, 3317, 29918, 1482, 29918, 517, 12360, 29952,
11070, 29947, 29896, 29929, 29906, 29897, 322, 421, 3317, 29918,
2848, 29952, 29898, 29922, 29896, 29906, 29906, 29947, 29947, 29897,
2833, 304, 505, 1063, 731, 29889, 421, 3317, 29918, 1482,
29918, 517, 12360, 29952, 674, 2125, 9399, 663, 29889, 3529,
2737, 304, 278, 5106, 363, 901, 2472, 29889, 313, 991,
597, 29882, 688, 3460, 2161, 29889, 1111, 29914, 2640, 29914,
9067, 414, 29914, 3396, 29914, 264, 29914, 3396, 29918, 13203,
29914, 726, 29918, 4738, 362, 29897, 13, 65108]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
Both max_new_tokens
(=10240) and max_length
(=20480) seem to have been set. max_new_tokens
will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [66,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [67,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [68,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [69,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [70,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [71,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [72,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [73,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [74,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [76,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [77,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [78,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [79,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [80,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [81,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [82,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [83,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [84,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [85,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [86,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [87,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [88,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [89,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [90,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [92,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [93,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [94,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [95,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
Exception in thread Thread-9 (generate):
Traceback (most recent call last):
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 1479, in generate
return self.greedy_search(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 2340, in greedy_search
outputs = self(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
outputs = self.model(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1070, in forward
layer_outputs = decoder_layer(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 706, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half
return torch.cat((-x2, x1), dim=-1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.