What does this PR do ?

Fixes #5471 This PR adds support for transformers as a backend Thanks @Isotr0py and @XuehaiPan for the help !

Additional features added:

[x] TP
[x] trust_remote_code=True
[x] quantization (tested with torchao and tp_size =2)
[ ] LORAs (maybe in a follow-up PR)

Examples

import sglang as sgl

if __name__ == "__main__":
    tp_size = 1
    # quantization works also with tp  -> set torchao_config="int4wo-128"
    # set impl = "transformers" to force using transformers implementation
    llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct", impl="transformers", tp_size=tp_size)
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 512}
    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(output)

Apr 30 '25 18:04 SunMarc

@jhinpan will evalute this with opt model

May 01 '25 05:05 zhaochenyang20

cc @zhaochenyang20 @SunMarc I tried to rerun the testing script and facing with this issue:

Writing report to /tmp/mmlu_meta-llama_Llama-3.2-1B-Instruct.html
{'other': 0.1875, 'other:std': 0.3903123748998999, 'score:std': 0.40232478717449166, 'stem': 0.2727272727272727, 'stem:std': 0.4453617714151233, 'humanities': 0.17391304347826086, 'humanities:std': 0.3790346907426672, 'social_sciences': 0.21428571428571427, 'social_sciences:std': 0.41032590332414487, 'score': 0.203125}
Writing results to /tmp/mmlu_meta-llama_Llama-3.2-1B-Instruct.json
Total latency: 9.732 s
Score: 0.203
E
======================================================================
ERROR: test_ci_models (__main__.TestTransformersFallbackEngine)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/utils.py", line 1925, in retry
    return fn()
  File "/workspace/sglang/python/sglang/test/test_utils.py", line 1118, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: Not all ROUGE-L scores are greater than rouge_l_tolerance=1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/test/test_utils.py", line 1117, in _callTestMethod
    retry(
  File "/workspace/sglang/python/sglang/srt/utils.py", line 1928, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_mmlu (__main__.TestTransformersFallbackTorchAO)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/utils.py", line 1925, in retry
    return fn()
  File "/workspace/sglang/python/sglang/test/test_utils.py", line 1118, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 0.203125 not greater than or equal to 0.25

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/test/test_utils.py", line 1117, in _callTestMethod
    retry(
  File "/workspace/sglang/python/sglang/srt/utils.py", line 1928, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 5 tests in 208.941s

FAILED (errors=2)

May 01 '25 21:05 jhinpan

I also tried to create a new testing script to especially test opt model:

Only to face with the issues below:

[2025-05-01 21:52:18] Received sigquit from a child process. It usually means the child failed.
EINFO 05-01 21:52:32 __init__.py:190] Automatically detected platform cuda.
INFO 05-01 21:52:32 __init__.py:190] Automatically detected platform cuda.
[2025-05-01 21:52:44] Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2216, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 268, in __init__
    self.tp_worker = TpWorkerClass(
  File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 64, in __init__
    self.worker = TpModelWorker(
  File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 82, in __init__
    self.model_runner = ModelRunner(
  File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 190, in __init__
    self.initialize(min_per_gpu_memory)
  File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 205, in initialize
    self.load_model()
  File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 458, in load_model
    self.model = get_model(
  File "/workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 372, in load_model
    model = _initialize_model(
  File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 148, in _initialize_model
    model_class, _ = get_model_architecture(model_config)
  File "/workspace/sglang/python/sglang/srt/model_loader/utils.py", line 101, in get_model_architecture
    architectures = resolve_transformers_arch(model_config, architectures)
  File "/workspace/sglang/python/sglang/srt/model_loader/utils.py", line 61, in resolve_transformers_arch
    raise ValueError(
ValueError: The Transformers implementation of OPTForCausalLM is not compatible with vLLM.

[2025-05-01 21:52:44] Received sigquit from a child process. It usually means the child failed.
[1]    21287 killed     python3 test/srt/models/test_transformers_opt_models.py

May 01 '25 22:05 jhinpan

Hey @jhinpan, this is expected for OPT model since it doesn't support yet the attention abstraction that we have for newer models. You can test for example granite, glm, gpt-neox or helium model for example. The compatible models have _supports_attention_backend=True

May 07 '25 12:05 SunMarc

Thanks for taking a look @jhinpan

I tried to rerun the testing script and facing with this issue:

Yeah indeed, I fixed the bound for the mmlu test since I was able to reproduce. I must have set it without testing it beforehand.

Not all ROUGE-L scores are greater than rouge_l_tolerance=1

For as the rouge scores, I'm not able to reproduce it. Can you check at the rouge-l score ? Maybe we need to change a bit the rouge_l_tolerance. I'm getting rouge_l_scores=[1.0, 1.0, 1.0, 1.0, 1.0]

I'm doing these tests on a A100 also. which hardware are you using ? Is it possible to run these tests on the CI ?

May 07 '25 13:05 SunMarc

Sure, I was testing on a H100. I will check the rouge-l score and let chenyang know with running CI tests as well. cc @zhaochenyang20

May 07 '25 14:05 jhinpan

@zhaochenyang20 I rebased on main, can you rerun the CI ?

May 08 '25 18:05 SunMarc

@SunMarc please fix the confilcts

May 10 '25 22:05 zhaochenyang20

@SunMarc Do not update your PR. Leave it alone. Let me do it. Thanks

May 13 '25 19:05 zhaochenyang20

@SunMarc Do not update your PR. Leave it alone. Let me do it. Thanks

Sounds good, thanks !

May 14 '25 11:05 SunMarc

cc @XuehaiPan @zhyncs. Can u guys take some final look and check whether it can be merged? LGTM rn.

May 14 '25 20:05 jhinpan

Nice work!

May 14 '25 20:05 zhyncs

Thanks for the review @CatherineSue ! I've resolved your comments

May 16 '25 12:05 SunMarc

@CatherineSue hey Chang, if you approve it, we can merge it after the CI.

May 16 '25 19:05 zhaochenyang20

@SunMarc great, no need to rebase. let me rerun all the CI

May 17 '25 16:05 zhaochenyang20

Thanks everyone for your help on this PR !

Jun 05 '25 09:06 SunMarc

[FEAT] Add transformers backend support

What does this PR do ?

Examples