[FEAT] Add transformers backend support
What does this PR do ?
Fixes #5471 This PR adds support for transformers as a backend Thanks @Isotr0py and @XuehaiPan for the help !
Additional features added:
- [x] TP
- [x]
trust_remote_code=True - [x] quantization (tested with torchao and tp_size =2)
- [ ] LORAs (maybe in a follow-up PR)
Examples
import sglang as sgl
if __name__ == "__main__":
tp_size = 1
# quantization works also with tp -> set torchao_config="int4wo-128"
# set impl = "transformers" to force using transformers implementation
llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct", impl="transformers", tp_size=tp_size)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 512}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(output)
@jhinpan will evalute this with opt model
cc @zhaochenyang20 @SunMarc I tried to rerun the testing script and facing with this issue:
Writing report to /tmp/mmlu_meta-llama_Llama-3.2-1B-Instruct.html
{'other': 0.1875, 'other:std': 0.3903123748998999, 'score:std': 0.40232478717449166, 'stem': 0.2727272727272727, 'stem:std': 0.4453617714151233, 'humanities': 0.17391304347826086, 'humanities:std': 0.3790346907426672, 'social_sciences': 0.21428571428571427, 'social_sciences:std': 0.41032590332414487, 'score': 0.203125}
Writing results to /tmp/mmlu_meta-llama_Llama-3.2-1B-Instruct.json
Total latency: 9.732 s
Score: 0.203
E
======================================================================
ERROR: test_ci_models (__main__.TestTransformersFallbackEngine)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/utils.py", line 1925, in retry
return fn()
File "/workspace/sglang/python/sglang/test/test_utils.py", line 1118, in <lambda>
lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: Not all ROUGE-L scores are greater than rouge_l_tolerance=1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/sglang/python/sglang/test/test_utils.py", line 1117, in _callTestMethod
retry(
File "/workspace/sglang/python/sglang/srt/utils.py", line 1928, in retry
raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.
======================================================================
ERROR: test_mmlu (__main__.TestTransformersFallbackTorchAO)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/utils.py", line 1925, in retry
return fn()
File "/workspace/sglang/python/sglang/test/test_utils.py", line 1118, in <lambda>
lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 0.203125 not greater than or equal to 0.25
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/sglang/python/sglang/test/test_utils.py", line 1117, in _callTestMethod
retry(
File "/workspace/sglang/python/sglang/srt/utils.py", line 1928, in retry
raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.
----------------------------------------------------------------------
Ran 5 tests in 208.941s
FAILED (errors=2)
I also tried to create a new testing script to especially test opt model:
Only to face with the issues below:
[2025-05-01 21:52:18] Received sigquit from a child process. It usually means the child failed.
EINFO 05-01 21:52:32 __init__.py:190] Automatically detected platform cuda.
INFO 05-01 21:52:32 __init__.py:190] Automatically detected platform cuda.
[2025-05-01 21:52:44] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2216, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 268, in __init__
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 64, in __init__
self.worker = TpModelWorker(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 82, in __init__
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 190, in __init__
self.initialize(min_per_gpu_memory)
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 205, in initialize
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 458, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 372, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 148, in _initialize_model
model_class, _ = get_model_architecture(model_config)
File "/workspace/sglang/python/sglang/srt/model_loader/utils.py", line 101, in get_model_architecture
architectures = resolve_transformers_arch(model_config, architectures)
File "/workspace/sglang/python/sglang/srt/model_loader/utils.py", line 61, in resolve_transformers_arch
raise ValueError(
ValueError: The Transformers implementation of OPTForCausalLM is not compatible with vLLM.
[2025-05-01 21:52:44] Received sigquit from a child process. It usually means the child failed.
[1] 21287 killed python3 test/srt/models/test_transformers_opt_models.py
Hey @jhinpan, this is expected for OPT model since it doesn't support yet the attention abstraction that we have for newer models. You can test for example granite, glm, gpt-neox or helium model for example. The compatible models have _supports_attention_backend=True
Thanks for taking a look @jhinpan
I tried to rerun the testing script and facing with this issue:
Yeah indeed, I fixed the bound for the mmlu test since I was able to reproduce. I must have set it without testing it beforehand.
Not all ROUGE-L scores are greater than rouge_l_tolerance=1
For as the rouge scores, I'm not able to reproduce it. Can you check at the rouge-l score ? Maybe we need to change a bit the rouge_l_tolerance. I'm getting rouge_l_scores=[1.0, 1.0, 1.0, 1.0, 1.0]
I'm doing these tests on a A100 also. which hardware are you using ? Is it possible to run these tests on the CI ?
Sure, I was testing on a H100. I will check the rouge-l score and let chenyang know with running CI tests as well. cc @zhaochenyang20
@zhaochenyang20 I rebased on main, can you rerun the CI ?
@SunMarc please fix the confilcts
@SunMarc Do not update your PR. Leave it alone. Let me do it. Thanks
@SunMarc Do not update your PR. Leave it alone. Let me do it. Thanks
Sounds good, thanks !
cc @XuehaiPan @zhyncs. Can u guys take some final look and check whether it can be merged? LGTM rn.
Nice work!
Thanks for the review @CatherineSue ! I've resolved your comments
@CatherineSue hey Chang, if you approve it, we can merge it after the CI.
@SunMarc great, no need to rebase. let me rerun all the CI
Thanks everyone for your help on this PR !