basaran icon indicating copy to clipboard operation
basaran copied to clipboard

Falcon 40B : too slow and random answers

Open ArnaudHureaux opened this issue 1 year ago • 7 comments

Hi, When i deployed the Falcon 40B model on the Basaran WebUI i had : -random answers, by example, when i said "hi", i get : " był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." -a very slow inference, whereas i was using a RunPod server costing $10 per hour with 4 GPU A100 80GB

I tried to custom the setting like that :
kwargs = { "local_files_only": local_files_only, "trust_remote_code": trust_remote_code, "torch_dtype": torch.bfloat16, "device_map": "auto" }

  • i used the half precision, but nothing changed,

Any idea how i could handle this issue ?

Thanks (and congrat for this beautiful webui !)

ArnaudHureaux avatar Jun 06 '23 12:06 ArnaudHureaux

Hi @ArnaudHureaux! I haven't used RunPod before, and there could be multiple reasons for this issue:

  1. Falcon models seem to require PyTorch 2.0, while Basaran's images use version 1.1.4.

  2. The custom settings you mentioned are not in the format accepted by Basaran. Options supported by Basaran can be found in the Dockerfile.

We will attempt to reproduce the issue using tiiuae/falcon-40b on our local machine later.

peakji avatar Jun 07 '23 03:06 peakji

Hi, The Falcon model is pretty bad when asking very small prompt, like hi, hello etc... you often get exactly that kind of output. If you ask a longer question, you will get a proper answer, it's not related with the basaran implementation

jgcb00 avatar Jun 08 '23 08:06 jgcb00

On my case, the answer was totally random with message like "był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." ??

I didn't have this comportment on other implementation, so i think that the problem is from the implementation ?

ArnaudHureaux avatar Jun 08 '23 09:06 ArnaudHureaux

Using only hugging face : I got the same result with load_in_8bit=True :

Question: hi
Answer:  (4).

'I don't think I'll ever be able to forget you.'

or :

Question: hi
Answer:  
It seems that the error is caused by a problem with your `onRequestSuccess` function. Specifically, the error message mentions that the function is returning an undefined value, and it seems like the `onRequestSuccess` is trying to return before the response from the server has been read.

To fix this error, you can try modifying the `onRequestSuccess` function to use Promises instead of callbacks. Instead of using `callback` to pass data to the next function, you can use `return` statements to return Promises.

Here's an example:


function onRequestSuccess(response) {
   return new Promise(function(resolve, reject) {
      console.log(response);

      // Parse JSON
      if (response.data && response.data.hasOwnProperty('success')) {
         resolve(response);
      } else {
         reject(response);
      }
   });
}

function onError(error) {
   console.log('Error:', error);
}

function sendRequest() {
  var requestData = { "username": "myusername", "password": "

jgcb00 avatar Jun 08 '23 09:06 jgcb00

If it helps, I updated the Dockerfile to use nvcr.io/nvidia/pytorch:23.05-py3 and was able to load the model referenced above and run inference. I can confirm that it runs slow for me, but I am attributing that to it not loading in GPU, even in 8-bit mode which should be able to run with just 45GB/RAM per https://huggingface.co/blog/falcon#fine-tuning-with-peft. I don't see these same quality issues as @ArnaudHureaux . To me that looks like a tokenizer problem maybe?

Inference with a short prompt:

~/basaran$ curl -w 'Total: %{time_total}s\n' http://127.0.0.1/v1/completions -H 'Content-Type: application/json' -d '{ "prompt": ["once upon a time,"], "echo": true }'

{"id":"cmpl-8ba3deeed1b838469f2a0d6e","object":"text_completion","created":1686333906,"model":"/models/falcon-40b","choices":[{"text":"once upon a time, spring 2011 was going to be the beginning of the bandeau bikini.","index":0,"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"completion_tokens":21,"total_tokens":26}}
Total: 274.909453s

GPUs when loaded:

| 22%   25C    P8               14W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   25C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   26C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   26C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   24C    P8               13W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   25C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   23C    P8               14W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   23C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   24C    P8               14W / 250W|      6MiB / 12288MiB |      0%      Default |

0xDigest avatar Jun 09 '23 18:06 0xDigest

Am I the only one who encountered an error saying I need to install the "einops" library when trying to deploy the Falcon 40B model ? This library is not part of the requirements.txt of the 0.19.0 version

Louanes1 avatar Jun 20 '23 10:06 Louanes1

einops is only used by the falcon model, it should not be a requirement for the package

jgcb00 avatar Jun 20 '23 11:06 jgcb00