Support Multi-User for Gradio Web Demo
Feature request / 功能建议
how to support multiple user for same gradio web instance ?
Motivation / 动机
1 model instance to serve many users generation at the same time
Your contribution / 您的贡献
NA
my trial doesn't work on changing gradio_web_demo.py as below
generate_button.click(
generate,
inputs=[prompt, num_inference_steps, guidance_scale, num_frames],
outputs=[video_output, download_video_button, download_gif_button],
+ concurrency_limit=3
)
enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])
if __name__ == "__main__":
+ demo.queue(max_size=20)
+ demo.launch(max_threads=10)
as below, 3 users parallel will inference together as expected
But at the end, each will fail with error below:
File "/home/jovyan/CogVideo/inference/gradio_web_demo.py", line 102, in infer
video = pipe(
^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 720, in __call__
video = self.decode_latents(latents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 345, in decode_latents
frames = self.vae.decode(latents).sample
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1184, in decode
decoded = self._decode(z).sample
^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1142, in _decode
return self.tiled_decode(z, return_dict=return_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1332, in tiled_decode
tile = self.decoder(tile)
^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 851, in forward
hidden_states = self.conv_in(sample)
^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 134, in forward
inputs = self.fake_context_parallel_forward(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 126, in fake_context_parallel_forward
inputs = torch.cat(cached_inputs + [inputs], dim=2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 18 but got size 45 for tensor number 1 in the list.
This issue seems like it shouldn't exist, as this demo is the one deployed on Hugging Face, and Hugging Face supports a user base far exceeding 2, yet similar errors have not occurred. Normally, users should queue up rather than accessing the GPU simultaneously. Even with just one person using it, the GPU utilization can reach nearly 100%.
I haven't tested the scenario of adding another task before the current one is completed. This aspect of the demo probably won't be adjusted (our resources are limited). Therefore, this might be handled with a lower priority.
Currently hugging face Space for CogVideo can only serve one request at a time, the following will wait in queue. but I think CogVideo can support many users running parallel, adding one concurrent user just add 2G+ HBM as I observed. so in a 48G GPU, a lot of concurrent users/queries can be supported. and the progressbar went to 100% then fail, so I think it's just a small problem when tensor dump to video.