Zijun Zhou
Zijun Zhou
- Update gprc proto to support - Request: token id or text (one of). - Response: token id, text or both of them. - Currently, request with either token id...
- Customer request: We use multiple languages for clients and cannot implement detokenization in each one. Need to have server-side detokenization support.
Can we refactor the imports to make MaxText as Python Modules? It's pretty hard for developers to use or develop on top of it. - Blocking inference development with JetStream....
- Optimized TPU duty cycle (largest gap < 4ms) - Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token...