text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Speed when splitting model across 2 gpu's

Open CyberTimon opened this issue 2 years ago • 3 comments

Hello Guys! I wan't to buy 2x3060 12gb as I get them for only 500.- Then I want to use llama 30b 4bit with it which uses 20gb gpu memory, so this should work with --auto-devices (As I have 24gb vram) One dude from a discord told me that this would work easily and but someone else told me that the speed would be very very bad. Has anyone already tried this or can confirm the performance impact? I hope to get like 5-6t/s Thanks in advice

CyberTimon avatar Apr 15 '23 19:04 CyberTimon

Well, for me (with 4090) 30b 4bit works, yes. But after context tokens reach 1000+, 24 GB VRAM seems not enough. I start to get problems (response with 0 tokens). So do not expect miracle even if you will run the model.

mudakisa avatar Apr 15 '23 23:04 mudakisa

It will be almost as fast as (or slightly slower than) a single 3060 with 24G VRAM, if that's fine with you then yes. You won't get the combined speed of two gpus.

sgsdxzy avatar Apr 16 '23 05:04 sgsdxzy

Thank you for the responses. That does my question. It's not a big problem if it's just a bit slower. I was just worried because the one from the discord said that the performance is horrible without nvlink. @mudakisa Do you also have displays connected to your 4090 which uses vram? Because I will run the 3060's in my headless server.

CyberTimon avatar Apr 16 '23 09:04 CyberTimon

Thank you for the responses. That does my question. It's not a big problem if it's just a bit slower. I was just worried because the one from the discord said that the performance is horrible without nvlink. @mudakisa Do you also have displays connected to your 4090 which uses vram? Because I will run the 3060's in my headless server.

Yes, the card works with 2 monitors connected to it.

mudakisa avatar Apr 16 '23 21:04 mudakisa