inference FLUX.1-dev cpu and cuda:0

System Info / 系統信息

ubuntu24, vllm 0.5.5 vllm-flash-attn 2.6.1 torch 2.4.0 cuda 12.4 transformers 4.44.2 diffusers 0.30.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

Release: v0.15.2

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local -H 0.0.0.0 -p 11435 --log-level debug

Reproduction / 复现过程

四张显卡 0，1分别为3090，2，3为3060(12G),运行FLUX.1-dev时，指定0或者1可以出图但很慢，指定2或者3出现Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

Expected behavior / 期待表现

1、FLUX.1-dev混合使用gpu,提高出图效率，现在是加quantize_text_encoder=text_encoder_2，效率太慢，不使用就显存溢出 2、如果quantize_text_encoder=text_encoder_2，任意指定一张超过9G显存的显卡都能使用 3、鉴于Ollama的支持gpu\cpu混合推理，但支持多模态不多，希望xinfe能和ollama并用

Sep 27 '24 02:09 sticktoFE

quantize_text_encoder=text_encoder_2 也需要大概11G左右的显存才能跑，低于这个比较困难。

我们未来会支持 gguf 量化的 FLUX.1 但还需要一段时间。

Sep 27 '24 03:09 qinxuye

This issue is stale because it has been open for 7 days with no activity.

Oct 04 '24 19:10 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Oct 09 '24 19:10 github-actions[bot]

quantize_text_encoder=text_encoder_2 也需要大概11G左右的显存才能跑，低于这个比较困难。

我们未来会支持 gguf 量化的 FLUX.1 但还需要一段时间。

什么时候会支持 gguf 量化的 FLUX.1 ，我们也是 3060 12G

Dec 23 '24 14:12 geekidentity

quantize_text_encoder=text_encoder_2 也需要大概11G左右的显存才能跑，低于这个比较困难。我们未来会支持 gguf 量化的 FLUX.1 但还需要一段时间。

什么时候会支持 gguf 量化的 FLUX.1 ，我们也是 3060 12G

帮忙开个新的issue吧，关于gguf版本

Dec 23 '24 14:12 qinxuye

quantize_text_encoder=text_encoder_2 也需要大概11G左右的显存才能跑，低于这个比较困难。我们未来会支持 gguf 量化的 FLUX.1 但还需要一段时间。

什么时候会支持 gguf 量化的 FLUX.1 ，我们也是 3060 12G

帮忙开个新的issue吧，关于gguf版本

https://github.com/xorbitsai/inference/issues/2698
这里

Dec 24 '24 02:12 geekidentity