web-llm Performance on NVIDIA GPU (discrete) seems to be much worse than AMD (integrated) GPU

I have an integrated AMD GPU (512 MB dedicated memory and 11.6GB shared memory) and a discreet NVIDIA GPU (6GB dedicated memory and 11.6 GB shared). Here are the results that were quite unexpected -

When using AMD, mostly shared memory was used (8.3/11.6 GB) but on NVIDIA it was the dedicated memory(5.7/6 GB). I expected the results to be opposite. I ran Chrome on NVIDIA and Canary on integrated AMD. (it did seem to me that different models were loaded but I do not have the screenshot for that)

May 11 '23 07:05 armsp

I think the models were the same.

AMD

[System Initalize] Initialize GPU device: WebGPU - amd [System Initalize] Fetching param cache[81/163]: 2006MB fetched. 49% completed, 22 secs elapsed. It can take a while when we first visit this page to populate the cache. Later refreshes will become faster. [System Initalize] Loading GPU shader modules[50/54]: 92% completed, 4 secs elapsed.

NVIDIA

[System Initalize] Initialize GPU device: WebGPU - NVIDIA GeForce RTX 3060 Laptop GPU [System Initalize] Fetching param cache[55/163]: 1372MB fetched. 34% completed, 10 secs elapsed. It can take a while when we first visit this page to populate the cache. Later refreshes will become faster. [System Initalize] Loading GPU shader modules[46/54]: 85% completed, 3 secs elapsed.

Performance still seems to be much better with integrated GPU than NVIDIA GPU.

May 11 '23 07:05 armsp

You're being bottlenecked by that 6gb of dedicated GPU memory on the Nvidia laptop GPU, which is constantly swapping data to inference the model, the integrated AMD though doesn't have to swap from its shared memory because all of its memory is shared (really the dedicated gpu memory for it is just reserved shared memory)

May 15 '23 23:05 Foxlum

@Foxlum I see, is there any way to somehow make this work faster while using NVIDIA gpu or will this always be a problem as long as the VRAM is less than expected?

May 21 '23 11:05 armsp

Are you sure it's even using the Nvidia card? The WebLLM demo loads the model into the Intel UHD integrated GPUs shared memory on my laptop and processes it there super slowly instead of the 3080 discrete GPU I have.

May 26 '23 22:05 RickieChang

@RickieChang Yes, if you look at the screenshots I have shared above you can see that it uses the NVIDIA card in the left image.

May 27 '23 08:05 armsp

this is likely due to vram issue, the latest update comes with a smaller model that should be faster

Jun 16 '23 15:06 tqchen

Performance on NVIDIA GPU (discrete) seems to be much worse than AMD (integrated) GPU - is that expected?

AMD

NVIDIA