vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bugfix] Fix LongRoPE bug

Open garg-amit opened this issue 1 year ago • 21 comments

This PR addresses this issue by using the number of original input tokens to decide using short or long factors to avoid the switch happening on one generation request: if the number of original input tokens < original_max_position_embeddings, it uses short factors; other than that, it uses long factors.

FIX #6438

Models tested before this PR:

  • microsoft/Phi-3.5-mini-instruct
  • microsoft/Phi-3-mini-4k-instruct
  • microsoft/Phi-3-small-8k-instruct
  • microsoft/Phi-3-small-128k-instruct
  • microsoft/Phi-3.5-MoE-instruct
  • OpenVINO/Phi-3-mini-128k-instruct-int4-ov

Backends tested:

  • XFORMERS
  • FLASHINFER
  • FLASH_ATTN
  • OPENVINO
  • ROCM_FLASH
PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

@robertgshaw2-neuralmagic @mgoin

garg-amit avatar Sep 07 '24 00:09 garg-amit

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Sep 07 '24 00:09 github-actions[bot]

@garg-amit is this PR ready for review?

robertgshaw2-redhat avatar Sep 08 '24 12:09 robertgshaw2-redhat

@robertgshaw2-neuralmagic yes, the PR is ready for review now

garg-amit avatar Sep 09 '24 20:09 garg-amit

@garg-amit what would need to happen to trigger this feature? Is there anything inside LM Eval harness that we could use to trigger and confirm this is working correctly in our CI

robertgshaw2-redhat avatar Sep 09 '24 20:09 robertgshaw2-redhat

@robertgshaw2-neuralmagic This fix addresses the scenario where the input length is shorter than the original_max_position_embeddings, but the combined input and output exceed the original_max_position_embeddings. I’m not sure which evals might trigger this case, but to check parity with other scenarios: for short-context evals, MMLU and GSM8k, and for long-context evals, Qasper can be triggered.

garg-amit avatar Sep 10 '24 05:09 garg-amit

@robertgshaw2-neuralmagic we performed some local testing to trigger the above scenario with Phi-3.5-mini-instruct. Here are the outputs before and after the fix.

Prompt:

Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.[2] Its best-known software products are the Windows line of operating systems, the Microsoft 365 suite of productivity applications, the Azure cloud computing platform and the Edge web browser. Its flagship hardware products are the Xbox video game consoles and the Microsoft Surface lineup of touchscreen personal computers. Microsoft ranked No. 14 in the 2022 Fortune 500 rankings of the largest United States corporations by total revenue;[3] and it was the world's largest software maker by revenue in 2022 according to Forbes Global 2000. It is considered one of the Big Five American information technology companies, alongside Alphabet (parent company of Google), Amazon, Apple, and Meta (parent company of Facebook).
...
[62] The European Union imposed another fine of €899 million ($1.4 billion) for Microsoft's lack of compliance with the March 2004 judgment on February 27, 2008, saying that the company charged rivals unreasonable prices for key information about its workgroup and backoffice servers.[63] Microsoft stated that it was in compliance and that "these fines are about the past issues that have been resolved".[64] 2007 also saw the creation of a multi-core unit at Microsoft, following the steps of server companies such as Sun and IBM.

Gates retired

Prompt length: 4093

Output before the fix:

 as chief software architectingessis.
as.




00orx








0.0omist.0aringing, with.
.
.





...


of0X.

game,


of000000000000s.0000000000000000000.
0X


s
0000X00XX



X0s of
of
of0orofof.0000xxon, of000xx.,
000


time,00o
XXXor0s.
0000XXXXX00

sXor.or.0s000s.
of.0on.
of.





000s
000oXNT.

00



such.
onon.

XorXXs00

After the fix:

 as chief software architect in June 2006, and as chairman of the board in February 2008.[65] In 2008, Microsoft released Windows Phone 7, a new mobile operating system for smartphones, and the Xbox 360 S, a revised version of the Xbox 360.[66][67] The company also released the Zune HD, a digital media player with video playback, and the Surface, a tablet computer.[68]

In 2009, Microsoft released Windows 7, a new version of Windows that was generally well received.[69] The company also released the Xbox 360 Wireless Gaming Receiver, a wireless adapter for the Xbox 360.[70]

In 2010, Microsoft released the Kinect, a motion-sensing input device for the Xbox 360 and Xbox One.[71] The company also released the Surface 2, a tablet computer with a new design and improved hardware.[72]

In 2011, Microsoft released Windows 8, a new version of Windows that was generally well received.[73] The company also released the Xbox 360 E, a revised version of the Xbox 360.[74]
...
Microsoft released

garg-amit avatar Sep 11 '24 16:09 garg-amit

@robertgshaw2-neuralmagic can I please get a review?

garg-amit avatar Sep 17 '24 05:09 garg-amit

@robertgshaw2-neuralmagic can I please get a review?

Yes. Will get this in for v0.6.2

robertgshaw2-redhat avatar Sep 17 '24 12:09 robertgshaw2-redhat

@robertgshaw2-neuralmagic can I please get a review?

Yes. Will get this in for v0.6.2

@garg-amit - I was waiting for tests to be green to merge. Are you planning to address them?

robertgshaw2-redhat avatar Sep 17 '24 12:09 robertgshaw2-redhat

@robertgshaw2-neuralmagic I’ve fixed most of the test cases. The remaining test cases seem unrelated to the current PR. Please let me know if there is anything else that needs to be fixed.

garg-amit avatar Sep 18 '24 07:09 garg-amit

@robertgshaw2-neuralmagic I’ve fixed most of the test cases. The remaining test cases seem unrelated to the current PR. Please let me know if there is anything else that needs to be fixed.

Thanks. Will review and merge tomorrow

robertgshaw2-redhat avatar Sep 20 '24 02:09 robertgshaw2-redhat

AMD CI has a broken node. Okay to ignore the issues

robertgshaw2-redhat avatar Sep 21 '24 21:09 robertgshaw2-redhat

@robertgshaw2-neuralmagic The PR is ready for another look.

garg-amit avatar Sep 25 '24 16:09 garg-amit

@robertgshaw2-neuralmagic Please let me know if there is anything else that needs to be addressed.

garg-amit avatar Oct 01 '24 05:10 garg-amit

Was OOO last week. My apologies. Will review today

robertgshaw2-redhat avatar Oct 01 '24 09:10 robertgshaw2-redhat

@robertgshaw2-neuralmagic ping

garg-amit avatar Oct 06 '24 20:10 garg-amit

@zhuohan123 @youkaichao @WoosukKwon could you help review this PR?

garg-amit avatar Oct 15 '24 16:10 garg-amit

Im so sorry @garg-amit - this slipped off my radar. Will give final review tonight

robertgshaw2-redhat avatar Oct 15 '24 16:10 robertgshaw2-redhat

LGTM! Getting the CI green so I pushed to your branch. Sorry again about the delay

robertgshaw2-redhat avatar Oct 17 '24 16:10 robertgshaw2-redhat

No worries! :) Thank you for reviewing, @robertgshaw2-neuralmagic!

garg-amit avatar Oct 17 '24 17:10 garg-amit

hey @garg-amit it looks like this is causing legitimate issues in the CI

[2024-10-17T19:15:39Z]         # Attention metadata.
[2024-10-17T19:15:39Z] >       attn_metadata = self.attn_metadata_builder.build(
[2024-10-17T19:15:39Z]             seq_lens, query_lens, num_orig_input_tokens_list,
[2024-10-17T19:15:39Z]             cuda_graph_pad_size, batch_size)
[2024-10-17T19:15:39Z] E       TypeError: PlaceholderAttentionMetadataBuilder.build() takes 5 positional arguments but 6 were given

robertgshaw2-redhat avatar Oct 18 '24 15:10 robertgshaw2-redhat

@robertgshaw2-neuralmagic I’ve fixed the above test case. However, the remaining test cases appear to be unrelated to the current PR. I tried to resolve them, but I couldn’t. Please let me know if those need to be resolved. I would appreciate your help.

garg-amit avatar Oct 28 '24 06:10 garg-amit

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @garg-amit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 06 '24 19:11 mergify[bot]

@robertgshaw2-neuralmagic when are you planning to merge this PR?

garg-amit avatar Nov 10 '24 23:11 garg-amit

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @garg-amit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 19 '24 03:11 mergify[bot]

@garg-amit @robertgshaw2-neuralmagic Is there an update on this PR?

DarshanDeshpande avatar Nov 27 '24 16:11 DarshanDeshpande

Awaiting Robert's response. @robertgshaw2-neuralmagic, could you please confirm if DCO is a hard requirement? While rebasing, I'm being prompted to sign off on commits that I didn't author. If required, I can create a new PR with the same changes.

garg-amit avatar Nov 27 '24 17:11 garg-amit

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @garg-amit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 30 '24 07:11 mergify[bot]

Awaiting Robert's response. @robertgshaw2-neuralmagic, could you please confirm if DCO is a hard requirement? While rebasing, I'm being prompted to sign off on commits that I didn't author. If required, I can create a new PR with the same changes.

DCO can be overidden. Dont worry about rebasing.

robertgshaw2-redhat avatar Dec 13 '24 22:12 robertgshaw2-redhat

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @garg-amit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jan 31 '25 05:01 mergify[bot]