vllm [Misc] `compressed-tensors` code reuse

Reuse existing implementations of compression and quantization configs defined by compressed-tensors
Adds compressed-tensors as a dependency to vllm

The reused classes are

CompressionFormat
QuantizationArgs
QuantizationStrategy
QuantizationType

Aug 07 '24 17:08 kylesayrs

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Aug 07 '24 17:08 github-actions[bot]

/ready

Aug 08 '24 18:08 dsikka

@kylesayrs you're missing https://github.com/vllm-project/vllm/blob/5923532e15b12832febe91ef5e9a4cef2786cefc/requirements-test.txt#L20

Aug 08 '24 21:08 dsikka

Current state looks good so far.

Biggest piece of feedback is that we are still rewriting the logic associated with parsing the config. Specifically, the get_scheme function in compressed-tensors.py has this duplicated code

https://github.com/vllm-project/vllm/pull/7277/files#diff-db60d9d797517c6bdf2ce051b532fe6fa6bceab04516c6d8c2d80a2ae803abe9L263

It will be tricky to fix this (because the vLLM state_dict is not a 1:1 map with the transformers state_dict), so feel free to reach out if you need any pointers.

Aug 09 '24 14:08 robertgshaw2-redhat

@robertgshaw2-neuralmagic I think updating the get_scheme function is beyond this scope of this PR. I'd like to first land using compressed-tensors without any dependency conflicts. Refactoring get_scheme should be a follow-up

Aug 09 '24 15:08 dsikka

These test failures seem unrelated to this PR? The a few seem to be cuda errors and one is complaining about bad llm metrics measurements

Aug 09 '24 16:08 kylesayrs

@robertgshaw2-neuralmagic I think updating the get_scheme function is beyond this scope of this PR. I'd like to first land using compressed-tensors without any dependency conflicts. Refactoring get_scheme should be a follow-up

Sounds good.

@kylesayrs im just running this by simon but we should be good to go

Aug 09 '24 16:08 robertgshaw2-redhat