Woosuk Kwon
Woosuk Kwon
We need to provide clean abstractions and interfaces so that users can easily plug in their custom models.
We should provide a clean abstraction and interface so that users can use their custom tokenizer very easily.
We are currently using the `-O2` flag in compiling our CUDA kernels. We need to investigate whether/how changing it to `-O3` affects the system performance and compilation time.
Only works for Falcon-7B for now. The Falcon-40B model generates garbage outputs. Needs debugging.
Should be merged after #273
Closes #61 This PR adds the BLOOM model and modifies the paged attention kernel to support ALiBi bias.
Closes #218 and #332 Should be merged after #61
While playing with it I've stumbled upon strange behavior that might indicate that there is some issue when the beam search is used. I've started the server with: `python3 -m...