text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

feat: support phi3.5 moe

Open drbh opened this issue 1 year ago • 2 comments

This is a work in progress PR to add support for microsoft/Phi-3.5-MoE-instruct

TODO

  • [X] add phi 3.5 to ModelType
  • [X] load weights into memory
  • [X] prefer moe over mlp in layers
  • [X] enable long/short rope scaling
  • [x] validate scaling logic
  • [x] ensure layer logic is correct
  • [x] ensure no regressions on existing phi models
  • [x] identify issue with allocating graphs
  • [x] refactor/cleanup/add tests

drbh avatar Aug 30 '24 15:08 drbh

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This PR adds support for phi 3.5 moe, and improves the chat endpoint to assume greedy generation unless the temp is explicitly set by the user in the request (this helped align the expected output from phi with the reference impl).

Start phi3.5moe

text-generation-launcher \
  --model-id microsoft/Phi-3.5-MoE-instruct \
  --num-shard 4 \
  --cuda-graphs 1,2 \
  --trust-remote-code

send a request

curl 127.0.0.1:3000/generate -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Hello who are you?",
    "parameters": {
      "max_new_tokens": 20
    }
  }'

response

{
    "generated_text": " I'm an artificial intelligence developed by Microsoft to assist with a variety of tasks and provide information."
}

drbh avatar Sep 03 '24 17:09 drbh