basaran
basaran copied to clipboard
Falcon 40B : too slow and random answers
Hi, When i deployed the Falcon 40B model on the Basaran WebUI i had : -random answers, by example, when i said "hi", i get : " był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." -a very slow inference, whereas i was using a RunPod server costing $10 per hour with 4 GPU A100 80GB
I tried to custom the setting like that :
kwargs = {
"local_files_only": local_files_only,
"trust_remote_code": trust_remote_code,
"torch_dtype": torch.bfloat16,
"device_map": "auto"
}
- i used the half precision, but nothing changed,
Any idea how i could handle this issue ?
Thanks (and congrat for this beautiful webui !)
Hi @ArnaudHureaux! I haven't used RunPod before, and there could be multiple reasons for this issue:
-
Falcon models seem to require PyTorch 2.0, while Basaran's images use version 1.1.4.
-
The custom settings you mentioned are not in the format accepted by Basaran. Options supported by Basaran can be found in the Dockerfile.
We will attempt to reproduce the issue using tiiuae/falcon-40b on our local machine later.
Hi, The Falcon model is pretty bad when asking very small prompt, like hi, hello etc... you often get exactly that kind of output. If you ask a longer question, you will get a proper answer, it's not related with the basaran implementation
On my case, the answer was totally random with message like "był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." ??
I didn't have this comportment on other implementation, so i think that the problem is from the implementation ?
Using only hugging face :
I got the same result with load_in_8bit=True
:
Question: hi
Answer: (4).
'I don't think I'll ever be able to forget you.'
or :
Question: hi
Answer:
It seems that the error is caused by a problem with your `onRequestSuccess` function. Specifically, the error message mentions that the function is returning an undefined value, and it seems like the `onRequestSuccess` is trying to return before the response from the server has been read.
To fix this error, you can try modifying the `onRequestSuccess` function to use Promises instead of callbacks. Instead of using `callback` to pass data to the next function, you can use `return` statements to return Promises.
Here's an example:
function onRequestSuccess(response) {
return new Promise(function(resolve, reject) {
console.log(response);
// Parse JSON
if (response.data && response.data.hasOwnProperty('success')) {
resolve(response);
} else {
reject(response);
}
});
}
function onError(error) {
console.log('Error:', error);
}
function sendRequest() {
var requestData = { "username": "myusername", "password": "
If it helps, I updated the Dockerfile to use nvcr.io/nvidia/pytorch:23.05-py3 and was able to load the model referenced above and run inference. I can confirm that it runs slow for me, but I am attributing that to it not loading in GPU, even in 8-bit mode which should be able to run with just 45GB/RAM per https://huggingface.co/blog/falcon#fine-tuning-with-peft. I don't see these same quality issues as @ArnaudHureaux . To me that looks like a tokenizer problem maybe?
Inference with a short prompt:
~/basaran$ curl -w 'Total: %{time_total}s\n' http://127.0.0.1/v1/completions -H 'Content-Type: application/json' -d '{ "prompt": ["once upon a time,"], "echo": true }'
{"id":"cmpl-8ba3deeed1b838469f2a0d6e","object":"text_completion","created":1686333906,"model":"/models/falcon-40b","choices":[{"text":"once upon a time, spring 2011 was going to be the beginning of the bandeau bikini.","index":0,"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"completion_tokens":21,"total_tokens":26}}
Total: 274.909453s
GPUs when loaded:
| 22% 25C P8 14W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 25C P8 15W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 26C P8 15W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 26C P8 15W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 24C P8 13W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 25C P8 15W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 23C P8 14W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 23C P8 15W / 250W| 6MiB / 12288MiB | 0% Default |
| 22% 24C P8 14W / 250W| 6MiB / 12288MiB | 0% Default |
Am I the only one who encountered an error saying I need to install the "einops" library when trying to deploy the Falcon 40B model ? This library is not part of the requirements.txt of the 0.19.0 version
einops
is only used by the falcon model, it should not be a requirement for the package