OpenLLM icon indicating copy to clipboard operation
OpenLLM copied to clipboard

bug: Not able to start tiiuae/falcon-7b

Open Kizy625 opened this issue 2 years ago • 4 comments

Describe the bug

Hi there,

I followed the instruction on GitHub to start tiiuae/falcon-7b.

pip install "openllm[falcon]" openllm start falcon --model-id tiiuae/falcon-7b Then, when calling the localhost:3000 for the first time, it's giving timeouts for 30 seconds.

The second time it gives back the next output (check the logs) and also timeouts after some time.

Thanks in advance!

To reproduce

No response

Logs

openllm start falcon --model-id tiiuae/falcon-7b
Make sure to have the following dependencies available: ['einops', 'xformers', 'safetensors']
2023-06-20T16:17:11+0000 [INFO] [cli] Environ for worker 0: set CUDA_VISIBLE_DEVICES to 0
2023-06-20T16:17:11+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-06-20T16:17:12+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-06-20 16:17:15.720532: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.81s/it]
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2023-06-20T16:19:42+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.616ms (trace=61c419d1f6ebf4618a33c76ab591ca84,span=f0534e5aa9799160,sampled=1,service.name=llm-falcon-runner)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:8] 127.0.0.1:32864 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 139.235ms (trace=61c419d1f6ebf4618a33c76ab591ca84,span=8acdd6ebfdfc0bc3,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.315ms (trace=e9158a075fc27b60719a6852115ec748,span=5147903d91a8f5cc,sampled=1,service.name=llm-falcon-runner)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:3] 127.0.0.1:32872 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 140.223ms (trace=e9158a075fc27b60719a6852115ec748,span=3340643d8fd8fbf3,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:8] 127.0.0.1:32874 (scheme=http,method=GET,path=/docs.json,type=,length=) (status=200,type=application/json,length=6855) 10.052ms (trace=78691d213c95604978c79a03e7af901e,span=b1ae35a9f48663c7,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:1] 127.0.0.1:32882 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 4.523ms (trace=994b1e4334df607b036138b15b5bd92d,span=8fe7b9288f48d7eb,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:7] 127.0.0.1:32896 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 3.804ms (trace=1cd9f72ec6c621f4dfc0378da339833f,span=f05a197843500866,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:43+0000 [INFO] [api_server:llm-falcon-service:4] 127.0.0.1:32900 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 3.348ms (trace=a2df8f149d8c22318f5bee1beef3b58b,span=38859e751c1b52fa,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:43+0000 [INFO] [api_server:llm-falcon-service:4] 127.0.0.1:32906 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 0.691ms (trace=b24a982fb330e7db790eee4e166c5fbe,span=63ab2145057b9fb1,sampled=1,service.name=llm-falcon-service)
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.

Environment

bentoml: 1.0.22 openllm: 0.1.8 platform: paperspace

Kizy625 avatar Jun 20 '23 16:06 Kizy625

falcon requires a lot of resource to run, even during inference.

This has to do with the model having to compute all of the matrices through the attention layer.

On a 4 A10G, The average latency I'm seeing is around 140s

aarnphm avatar Jun 20 '23 17:06 aarnphm

Hey,

Yes, I know, but in my case I do not think it is a resource problem. It not about the response time, it is not responding at all.

With paperspace I created a dedicated GPU instance A100 GPU instance.

With 12 CPU, 90GB Memory, without any additional services running on this instance.

That's why I thought the problem is in the logs:

The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

Kizy625 avatar Jun 20 '23 18:06 Kizy625

Got it, i will take a look

aarnphm avatar Jun 20 '23 22:06 aarnphm

I was only able to run Falcon on g5.24xlarge, which has 96GB GPU mem, 384GB ram :)

aarnphm avatar Jun 26 '23 21:06 aarnphm

Wow Okay, I will give it a try Thanks!

Kizy625 avatar Jun 29 '23 08:06 Kizy625