OpenLLM
OpenLLM copied to clipboard
bug: Not able to start tiiuae/falcon-7b
Describe the bug
Hi there,
I followed the instruction on GitHub to start tiiuae/falcon-7b.
pip install "openllm[falcon]" openllm start falcon --model-id tiiuae/falcon-7b
Then, when calling the localhost:3000 for the first time, it's giving timeouts for 30 seconds.
The second time it gives back the next output (check the logs) and also timeouts after some time.
Thanks in advance!
To reproduce
No response
Logs
openllm start falcon --model-id tiiuae/falcon-7b
Make sure to have the following dependencies available: ['einops', 'xformers', 'safetensors']
2023-06-20T16:17:11+0000 [INFO] [cli] Environ for worker 0: set CUDA_VISIBLE_DEVICES to 0
2023-06-20T16:17:11+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-06-20T16:17:12+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-06-20 16:17:15.720532: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.81s/it]
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2023-06-20T16:19:42+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.616ms (trace=61c419d1f6ebf4618a33c76ab591ca84,span=f0534e5aa9799160,sampled=1,service.name=llm-falcon-runner)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:8] 127.0.0.1:32864 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 139.235ms (trace=61c419d1f6ebf4618a33c76ab591ca84,span=8acdd6ebfdfc0bc3,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.315ms (trace=e9158a075fc27b60719a6852115ec748,span=5147903d91a8f5cc,sampled=1,service.name=llm-falcon-runner)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:3] 127.0.0.1:32872 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 140.223ms (trace=e9158a075fc27b60719a6852115ec748,span=3340643d8fd8fbf3,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:8] 127.0.0.1:32874 (scheme=http,method=GET,path=/docs.json,type=,length=) (status=200,type=application/json,length=6855) 10.052ms (trace=78691d213c95604978c79a03e7af901e,span=b1ae35a9f48663c7,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:1] 127.0.0.1:32882 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 4.523ms (trace=994b1e4334df607b036138b15b5bd92d,span=8fe7b9288f48d7eb,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:7] 127.0.0.1:32896 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 3.804ms (trace=1cd9f72ec6c621f4dfc0378da339833f,span=f05a197843500866,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:43+0000 [INFO] [api_server:llm-falcon-service:4] 127.0.0.1:32900 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 3.348ms (trace=a2df8f149d8c22318f5bee1beef3b58b,span=38859e751c1b52fa,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:43+0000 [INFO] [api_server:llm-falcon-service:4] 127.0.0.1:32906 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 0.691ms (trace=b24a982fb330e7db790eee4e166c5fbe,span=63ab2145057b9fb1,sampled=1,service.name=llm-falcon-service)
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Environment
bentoml: 1.0.22 openllm: 0.1.8 platform: paperspace
falcon requires a lot of resource to run, even during inference.
This has to do with the model having to compute all of the matrices through the attention layer.
On a 4 A10G, The average latency I'm seeing is around 140s
Hey,
Yes, I know, but in my case I do not think it is a resource problem. It not about the response time, it is not responding at all.
With paperspace I created a dedicated GPU instance A100 GPU instance.
With 12 CPU, 90GB Memory, without any additional services running on this instance.
That's why I thought the problem is in the logs:
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
Got it, i will take a look
I was only able to run Falcon on g5.24xlarge, which has 96GB GPU mem, 384GB ram :)
Wow Okay, I will give it a try Thanks!