text-generation-webui
                                
                                
                                
                                    text-generation-webui copied to clipboard
                            
                            
                            
                        Speed when splitting model across 2 gpu's
Hello Guys! I wan't to buy 2x3060 12gb as I get them for only 500.- Then I want to use llama 30b 4bit with it which uses 20gb gpu memory, so this should work with --auto-devices (As I have 24gb vram) One dude from a discord told me that this would work easily and but someone else told me that the speed would be very very bad. Has anyone already tried this or can confirm the performance impact? I hope to get like 5-6t/s Thanks in advice
Well, for me (with 4090) 30b 4bit works, yes. But after context tokens reach 1000+, 24 GB VRAM seems not enough. I start to get problems (response with 0 tokens). So do not expect miracle even if you will run the model.
It will be almost as fast as (or slightly slower than) a single 3060 with 24G VRAM, if that's fine with you then yes. You won't get the combined speed of two gpus.
Thank you for the responses. That does my question. It's not a big problem if it's just a bit slower. I was just worried because the one from the discord said that the performance is horrible without nvlink. @mudakisa Do you also have displays connected to your 4090 which uses vram? Because I will run the 3060's in my headless server.
Thank you for the responses. That does my question. It's not a big problem if it's just a bit slower. I was just worried because the one from the discord said that the performance is horrible without nvlink. @mudakisa Do you also have displays connected to your 4090 which uses vram? Because I will run the 3060's in my headless server.
Yes, the card works with 2 monitors connected to it.