vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bugfix] Add custom Triton cache manager to resolve MoE MP issue

Open tdoublep opened this issue 1 year ago • 10 comments

Fixes #6103

We have been using this fix via our fork (see here) for a while and it seems stable.

~~Note, this will only resolve the problem if you are using vLLM from the docker image~~. Maybe a better approach would be to bundle the custom cache manager code inside vllm package, that way it will get shipped via pip install too, and the user could still set env variable to enable it.

Update: I've now implemented it by including the custom cache manager inside vLLM and setting the necessary env variable via code.

cc @jeejeelee

tdoublep avatar Jul 04 '24 13:07 tdoublep

Thanks

IMHO, this issue should be addressed by bundling the custom cache manager code inside the vllm.

cc @simon-mo @youkaichao @Yard1

jeejeelee avatar Jul 04 '24 13:07 jeejeelee

Thanks @tdoublep, I had mentioned this to @youkaichao previously but kept forgetting to open a PR.

Not immediately obvious why this seems to only affect the mp case and not ray.

I agree that it would be better for this to be incorporated into the library if possible.

I wonder if we could open a PR or issue in the triton for this (if one doesn't already exist)

njhill avatar Jul 04 '24 14:07 njhill

@njhill @jeejeelee I have re-implemented it as part of the vllm library.

One thing I'm not sure about is whether setting the env variable from fused_moe code is sufficient, or whether there are other parts of the code where this fix would be needed. Maybe it's OK for now.

tdoublep avatar Jul 04 '24 14:07 tdoublep

After reading the conversation here, it sounds like we would also need to set this env variable accordingly when using Triton punica kernel (e.g., once we merge this PR).

tdoublep avatar Jul 04 '24 15:07 tdoublep

After reading the conversation here, it sounds like we would also need to set this env variable accordingly when using Triton punica kernel (e.g., once we merge this PR).

Even if we don't consider #5036, prefix_prefill and triton_flash_attention are still necessary.

jeejeelee avatar Jul 04 '24 15:07 jeejeelee

Even if we don't consider https://github.com/vllm-project/vllm/pull/5036, prefix_prefill and triton_flash_attention are still necessary.

@jeejeelee ok, in that case I guess it makes sense to call maybe_set_triton_cache_manager in one single place rather than in each individual place we use Triton. perhaps we can do it if we detect tp>1 and multi-processing being used?

tdoublep avatar Jul 04 '24 17:07 tdoublep

@njhill @jeejeelee I've moved the call to maybe_set_triton_cache_manager to the MultiprocessingGPUExecutor. I guess this is safer and should cover all cases we need.

tdoublep avatar Jul 04 '24 19:07 tdoublep

CI tests failures look like network blips (Read timed out.)

tdoublep avatar Jul 05 '24 11:07 tdoublep

CI failure looks unrelated:

FAILED distributed/test_multimodal_broadcast.py::test_models[5-128-half-2] - huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on

tdoublep avatar Jul 05 '24 19:07 tdoublep

Not immediately obvious why this seems to only affect the mp case and not ray.

I'm also wondering this, too. cc anyscale folks @cadedaniel @Yard1 for visibility.

youkaichao avatar Jul 06 '24 01:07 youkaichao

All comments have been addressed. Is there anything else you would like to see? @comaniac @njhill @jeejeelee @simon-mo

I think it would be good to get this one in since there are quite a few people struggling with this issue.

tdoublep avatar Jul 15 '24 09:07 tdoublep

merging to unblock release

simon-mo avatar Jul 15 '24 17:07 simon-mo