Faster inference, Paged attention from vllm

Open AngainorDev opened this issue 2 years ago • 1 comments

I'm having great qualitative results from Falcon finetuned with adaptersv2.

The inference is better than what I have with huggingface/peft and lora, but still slow for scaling up.

Could the ideas or code from Paged attention https://github.com/vllm-project/vllm be used to really speed up the inference with parallel sampling and larger batch sizes?

Jul 02 '23 07:07 AngainorDev

This can be fixed by adding a state for PasteInput value property: const [rawtext, setRawtext] = React.useState('');

and setting it:

        <PasteInput
          testID="composerTextInput"
          ref={textInput}
          value={rawtext}

Jul 09 '23 23:07 chead