CogVideo Support Multi-User for Gradio Web Demo

Feature request / 功能建议

how to support multiple user for same gradio web instance ?

Motivation / 动机

1 model instance to serve many users generation at the same time

Your contribution / 您的贡献

NA

Sep 27 '24 04:09 panpan0000

my trial doesn't work on changing gradio_web_demo.py as below

    generate_button.click(
        generate,
        inputs=[prompt, num_inference_steps, guidance_scale, num_frames],
        outputs=[video_output, download_video_button, download_gif_button],
+        concurrency_limit=3
    )

    enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])

if __name__ == "__main__":
+    demo.queue(max_size=20)
+    demo.launch(max_threads=10)

as below, 3 users parallel will inference together as expected

But at the end, each will fail with error below:

File "/home/jovyan/CogVideo/inference/gradio_web_demo.py", line 102, in infer
    video = pipe(
            ^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 720, in __call__
    video = self.decode_latents(latents)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 345, in decode_latents
    frames = self.vae.decode(latents).sample
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1184, in decode
    decoded = self._decode(z).sample
              ^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1142, in _decode
    return self.tiled_decode(z, return_dict=return_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1332, in tiled_decode
    tile = self.decoder(tile)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 851, in forward
    hidden_states = self.conv_in(sample)
                    ^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 134, in forward
    inputs = self.fake_context_parallel_forward(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/baize-runtime-env/peter-vllm-python3-17/conda/envs/peter-vllm-python3-17/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 126, in fake_context_parallel_forward
    inputs = torch.cat(cached_inputs + [inputs], dim=2)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 18 but got size 45 for tensor number 1 in the list.

Sep 27 '24 04:09 panpan0000

This issue seems like it shouldn't exist, as this demo is the one deployed on Hugging Face, and Hugging Face supports a user base far exceeding 2, yet similar errors have not occurred. Normally, users should queue up rather than accessing the GPU simultaneously. Even with just one person using it, the GPU utilization can reach nearly 100%.

I haven't tested the scenario of adding another task before the current one is completed. This aspect of the demo probably won't be adjusted (our resources are limited). Therefore, this might be handled with a lower priority.

Sep 27 '24 06:09 zRzRzRzRzRzRzR

Currently hugging face Space for CogVideo can only serve one request at a time, the following will wait in queue. but I think CogVideo can support many users running parallel, adding one concurrent user just add 2G+ HBM as I observed. so in a 48G GPU, a lot of concurrent users/queries can be supported. and the progressbar went to 100% then fail, so I think it's just a small problem when tensor dump to video.

Sep 27 '24 06:09 panpan0000