Woosuk Kwon

Results 65 issues of Woosuk Kwon

We need to provide clean abstractions and interfaces so that users can easily plug in their custom models.

We should provide a clean abstraction and interface so that users can use their custom tokenizer very easily.

We are currently using the `-O2` flag in compiling our CUDA kernels. We need to investigate whether/how changing it to `-O3` affects the system performance and compilation time.

performance

Only works for Falcon-7B for now. The Falcon-40B model generates garbage outputs. Needs debugging.

Should be merged after #273

Closes #61 This PR adds the BLOOM model and modifies the paged attention kernel to support ALiBi bias.

Closes #218 and #332 Should be merged after #61

While playing with it I've stumbled upon strange behavior that might indicate that there is some issue when the beam search is used. I've started the server with: `python3 -m...