sglang Adding GPT-NeoX

I followed along the instructions here to add GPT-NeoX support which would bring support for the Pythia model family and other similar architecture models.

Reference: https://github.com/sgl-project/sglang/issues/157#issue-2122338478

FIXED (Keeping Logs for Future Reference): I was able to debug most errors but I'm stuck on this particular error which happens once I start requesting on the endpoint (i.e. it loads correctly I assume) -

INFO:     Started server process [1344199]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30013 (Press CTRL+C to quit)
INFO:     127.0.0.1:37998 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 17. #remaining_req: 0. #running_req: 0
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 165, in exposed_step
    self.forward_step()
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 180, in forward_step
    self.forward_fill_batch(new_batch)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 369, in forward_fill_batch
    logits, (logprobs, normalized_logprobs) = self.model_runner.forward(
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_runner.py", line 486, in forward
    return self.forward_extend(**kwargs)
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_runner.py", line 391, in forward_extend
    return self.model.forward(input_ids, input_metadata.positions, input_metadata)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/models/gpt_neox.py", line 236, in forward
    return self.logits_processor(
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/layers/logits_processor.py", line 32, in forward
    last_logits = torch.matmul(last_hidden, weight.T)
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'ParallelLMHead' object has no attribute 'T'

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 165, in exposed_step
    self.forward_step()
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 180, in forward_step
    self.forward_fill_batch(new_batch)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 369, in forward_fill_batch
    logits, (logprobs, normalized_logprobs) = self.model_runner.forward(
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_runner.py", line 486, in forward
    return self.forward_extend(**kwargs)
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_runner.py", line 391, in forward_extend
    return self.model.forward(input_ids, input_metadata.positions, input_metadata)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/models/gpt_neox.py", line 236, in forward
    return self.logits_processor(
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/layers/logits_processor.py", line 32, in forward
    last_logits = torch.matmul(last_hidden, weight.T)
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'ParallelLMHead' object has no attribute 'T'

/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py:204: UserWarning: Warning: available_size=714944, max_total_num_token=714961
KV cache pool leak detected!
  warnings.warn(
/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py:204: UserWarning: Warning: available_size=714944, max_total_num_token=714961
KV cache pool leak detected!
  warnings.warn(

Any idea what might be going wrong? It seems that the error is related to the LogitProcessor which I'm not very familiar with. I've tried to copy the logic from the llama implementation for the same

Feb 07 '24 21:02 aflah02

Update: I just noticed the missing part and changed that which fixes the old issue but now I get a new error -

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 165, in exposed_step
    self.forward_step()
  File "/NS/llm-1/nobackup/afkhan/anaconda3/envs/sglangfact/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 192, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/model_rpc.py", line 429, in forward_decode_batch
    next_token_ids, next_token_probs = batch.sample(logits)
  File "/NS/llm-1/work/afkhan/sglang/python/sglang/srt/managers/router/infer_batch.py", line 452, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Feb 07 '24 21:02 aflah02

I did some further testing. It runs perfectly for - https://github.com/aflah02/sglang/blob/main/examples/usage/choices_logprob.py But fails for https://github.com/aflah02/sglang/blob/main/examples/quick_start/srt_example_chat.py with the error above Seems like the issue might be elsewhere

Feb 07 '24 21:02 aflah02

@merrymercy Any thoughts? Not sure why one tutorial works while the other doesn't

Feb 09 '24 20:02 aflah02

@aflah02

Can you try this tutorial? https://github.com/sgl-project/sglang/blob/main/examples/quick_start/srt_example_complete.py The chat example does not work properly, possibly due to the vicuna chat template. The default chat template is vicuna, but GPT-NeoX has not been tuned on that template.
Can you add more print statements to see where the nan comes from? Does it occur in early transformers layers? Does it only occur in the last layer?

Feb 11 '24 14:02 merrymercy

@merrymercy For Part 2 It seems that the error mainly occurs in the last few layers/last layer. Some of the logs are here for the chat example - logs.txt The first tutorial also gives a similar error

Feb 20 '24 18:02 aflah02

@merrymercy Any thoughts on what might be going wrong here? I don't know whether a template can make such breaking issues

Mar 01 '24 18:03 aflah02

@aflah02 I have no idea. I typically debug these kinds of wired bugs by comparing intermediate tensors layer by layer between sglang and huggingface/vllm implementations, similar to your print statements.

Mar 11 '24 02:03 merrymercy

@merrymercy Sorry for being inactive, life got really busy the past few months. I don't have the bandwidth nowadays to take this on and if you want to then feel free to work on this

Jun 12 '24 14:06 aflah02

I will close this for now

Jun 12 '24 22:06 merrymercy