vllm [Frontend] Dynamic RoPE scaling

[Frontend] Dynamic RoPE scaling

Open sasha0552 opened this issue 9 months ago • 0 comments

In #555, @WoosukKwon removed dynamic specifying of RoPE scaling with the comment:

As we discussed offline, I removed rope_scaling from ModelConfig and EngineArgs. Now rope_scaling is always read from the model's config.json.

I don't understand why this feature was removed, so this PR brings it back. Specifying the RoPE scaling on the command line is very useful, because otherwise we have to manually modify the config.json that can be managed by huggingface - so each model has to be forked to properly set a different RoPE scaling.

FIX #4334

RoPE scaling allows to use higher context without further fine-tuning

Summarization using meta-llama/Meta-Llama-3-8B-Instruct (has a native context of 8192 tokens) with type = linear and factor = 2.0:

ChatCompletion(id='cmpl-11ec2bc71a734a1eba13eaca5e0c54d9', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The article provides information about CUDA, a parallel computing platform and programming model created by NVIDIA. It allows software developers to use graphics processing units (GPUs) for general-purpose computing, rather than just graphics processing graphics. CUDA is a proprietary, but has been adopted by many developers. The article covers CUDA's history, features, technical specifications, capabilities, and usage.\n\nCUDA's features and capabilities include:\n\n* Parallel programming model: CUDA allows for parallel processing of tasks on multiple threads and blocks\n* Memory management: CUDA's memory hierarchy includes registers, shared memory, and global memory\n* Execution: CUDA code can be executed on multiple GPUs and CPUs\n* Interoperability: CUDA supports various programming languages, including C, C++, Fortran, and Python\n* APIs: CUDA has a low-level API (CUDA Driver) and high-level (CUDA) API, with libraries and runtime\n\nCUDA has several advantages, including:\n\n* Higher performance, especially for computationally intensive tasks\n* Larger memory bandwidth and storage\n* Better power efficiency\n\nCUDA's compute capabilities include:\n\n* Wide range of compute capabilities (1.0 to 11.1)\n* Different memory types (shared, global, texture, and constant)\n* Instructions (e.g., ALU, INT, FP, and FP16)\n* Number of AL lanes, texture mapping units, and scheduling\n* Warp size and block sizes\n\nCUDA is used for various applications, including:\n\n* Accelerated rendering, video, encryption, and decryption\nBioinformatics, medical simulations, machine learning\nNeural network, proteins, cryptography, and more\n\nCUDA competes with other GPU computing stacks, such as Intel's OneAPI and AMD's ROCm.", role='assistant', function_call=None, tool_calls=None), stop_reason=128009)], created=1715044970, model='meta-llama/Meta-Llama-3-8B-Instruct', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=344, prompt_tokens=12648, total_tokens=12992))

May 07 '24 01:05 sasha0552

vllm vllm copied to clipboard

[Frontend] Dynamic RoPE scaling

vllm
vllm copied to clipboard