开源的模型里哪个是100K上下文的版本？我看了最新的v6版本好像是只有8k吧？

Apr 11 '24 10:04 xiaoyuer2019

v6两个模型都支持100k, config里的数值只是placeholder, 我们测试用8x 40g 卡支持到总长度100k,没有问题，如果8x80g, 可支持200k,如下命令可测试：

可以根据实际硬件情况调整max_input/generate_length

export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-70b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 37888 --max_generate_length 62112

Apr 12 '24 08:04 chentigerye

我这里只有9块3090的显卡，请问有v6量化版的么？量化版的是不是可以支持100K？

Apr 12 '24 08:04 xiaoyuer2019

70b chat v6的量化版本在这里：https://huggingface.co/TigerResearch/tigerbot-70b-chat-v6-4bit-exl2

Apr 15 '24 05:04 Vivicai1005

谢谢，4bit量化的也可以推理100K的上下文吧？

Apr 15 '24 07:04 xiaoyuer2019

可以的

Apr 16 '24 05:04 chentigerye

请问exllama2量化的模型用什么框架推理可以使用api接口？

Apr 16 '24 12:04 xiaoyuer2019

量化模型使用下面参数启动模型 export PYTHONPATH='./' ; CUDA_VISIBLE_DEVICES=1,2,3,4 streamlit run apps/exllamav2_web_demo.py -- --model_path /data/model/tigerbot-70b-chat-v6-4bit-exl2/tigerbot --max_input_length 37888 --max_generate_length 62112

在长文本推理的时候会报下面的错误 Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation. Both max_new_tokens (=62112) and max_length(=100000) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation) Exception in thread Thread-10 (eval_generate): Traceback (most recent call last): File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 214, in eval_generate model.generate(**args) File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate result = self._greedy_search( File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search outputs = self( File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 112, in call self.ex_model.forward( File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/exllamav2/model.py", line 662, in forward assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward" AssertionError: Total sequence length exceeds cache size in model.forward

Apr 16 '24 12:04 xiaoyuer2019

13B模型使用下面参数启动模型 export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 ; streamlit run apps/web_demo.py -- --model_path /data/model/tigerbot-13b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 10240 --max_generate_length 10240

在长文本推理的时候会报下面的错误

Namespace(model_path='/data/model/tigerbot-13b-chat-v6', rope_scaling='yarn', rope_factor=8.0, max_input_length=10240, max_generate_length=10240) Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation. {'input_ids': tensor([[65107, 9134, 421, 3317, 29918, 1482, 29918, 517, 12360, 29952, 11070, 29947, 29896, 29929, 29906, 29897, 322, 421, 3317, 29918, 2848, 29952, 29898, 29922, 29896, 29906, 29906, 29947, 29947, 29897, 2833, 304, 505, 1063, 731, 29889, 421, 3317, 29918, 1482, 29918, 517, 12360, 29952, 674, 2125, 9399, 663, 29889, 3529, 2737, 304, 278, 5106, 363, 901, 2472, 29889, 313, 991, 597, 29882, 688, 3460, 2161, 29889, 1111, 29914, 2640, 29914, 9067, 414, 29914, 3396, 29914, 264, 29914, 3396, 29918, 13203, 29914, 726, 29918, 4738, 362, 29897, 13, 65108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')} Both max_new_tokens (=10240) and max_length(=20480) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation) ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [66,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [67,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [68,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [69,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [70,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [71,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [72,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [73,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [74,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [76,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [77,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [78,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [79,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [80,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [81,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [82,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [83,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [84,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [85,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [86,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [87,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [88,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [89,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [90,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [92,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [93,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [94,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [95,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. Exception in thread Thread-9 (generate): Traceback (most recent call last): File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 1479, in generate return self.greedy_search( File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 2340, in greedy_search outputs = self( File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward outputs = self.model( File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1070, in forward layer_outputs = decoder_layer( File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 706, in forward query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb q_embed = (q * cos) + (rotate_half(q) * sin) File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half return torch.cat((-x2, x1), dim=-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Apr 16 '24 13:04 xiaoyuer2019

TigerBot TigerBot copied to clipboard

开源的模型里哪个是100K上下文的版本？

可以根据实际硬件情况调整max_input/generate_length

TigerBot
TigerBot copied to clipboard