Whisper Multiple GPUs of Same Name

using iModel model = Library.loadModel(cla.model); I am doing some testing with 2 x RTX 3070s that are showing as the same name, I think it would be helpful to adjust this to use an integer based index for selecting the GPU for cases like this. 👍

Mar 28 '23 21:03 adamreed90

I modified selectAdapter to take an index or an adapter name, this worked out great!

(Full Disclosure, I used ChatGPT for this ... :( )

listGPUS.cpp:

CComPtr<IDXGIAdapter1> selectAdapter(const std::wstring& requestedName)
	{
	    if (requestedName.empty())
	        return nullptr;

	    CComPtr<IDXGIFactory1> dxgi;
	    HRESULT hr = createFactory(dxgi);
	    if (FAILED(hr))
	    {
	        logWarningHr(hr, u8"CreateDXGIFactory1 failed");
	        return nullptr;
	    }

	    std::wstring name;
	    UINT index = UINT_MAX;

	    // Check if the requested name is a number (i.e., index)
	    try {
	        index = std::stoi(requestedName);
	    }
	    catch (std::invalid_argument&) {
	        // The requested name is not a number; proceed with name lookup
	    }

	    for (UINT i = 0; true; i++)
	    {
	        CComPtr<IDXGIAdapter1> adapter;
	        hr = dxgi->EnumAdapters1(i, &adapter);
	        if (hr == DXGI_ERROR_NOT_FOUND)
	        {
	            logWarning16(L"Requested GPU not found: \"%s\"", requestedName.c_str());
	            return nullptr;
	        }

	        if (FAILED(hr))
	        {
	            logErrorHr(hr, u8"IDXGIFactory1.EnumAdapters1 failed");
	            return nullptr;
	        }

	        DXGI_ADAPTER_DESC1 desc;
	        adapter->GetDesc1(&desc);
	        setName(name, desc);

	        if (index != UINT_MAX && index == i)
	            return adapter;
	        else if (name == requestedName)
	            return adapter;
	    }
	}
	````

Mar 28 '23 22:03 adamreed90

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built.

Thank you so much @Const-me for your work on this project, it's quite impressive!

Mar 30 '23 22:03 adamreed90

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built.

Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

Mar 31 '23 12:03 maxaki

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built. Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

I noticed something strangely similar, the first GPU would use 5.9GB/8GB and the second only 2.9/8GB of VRAM, however they would both do 3 simul. transcriptions.

Virtualization unfortunately isn't an option with the intended hardware setup I have available.

Mar 31 '23 13:03 adamreed90

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built. Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

I noticed something strangely similar, the first GPU would use 5.9GB/8GB and the second only 2.9/8GB of VRAM, however they would both do 3 simul. transcriptions.

Virtualization unfortunately isn't an option with the intended hardware setup I have available.

Shouldn't be an issue with hyperv pci passthrough and install nvidia drivers natively on the vm. GeForce series isn't officially supported by microsoft&nvidia but there's easy workarounds.

Mar 31 '23 14:03 maxaki

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built. Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

I noticed something strangely similar, the first GPU would use 5.9GB/8GB and the second only 2.9/8GB of VRAM, however they would both do 3 simul. transcriptions. Virtualization unfortunately isn't an option with the intended hardware setup I have available.

Shouldn't be an issue with hyperv pci passthrough and install nvidia drivers natively on the vm. GeForce series isn't officially supported by microsoft&nvidia but there's easy workarounds.

Unfortunately I'm using a special purpose SBC not a standard PC with very limited resources and capabilities, it wouldn't handle multiple Windows VMs. My end goal is to get this working on Linux in Containers.

Mar 31 '23 14:03 adamreed90

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build.

One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow https://github.com/Const-me/Whisper/issues/49#issuecomment-1474915688 Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

Apr 02 '23 11:04 Const-me

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build.

One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow #49 (comment) Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

I did some tests with that on RTX 3700 (8gb vram) and the output relative speed remained the same running 1 and 2 instances. The performance just cut in half when cloning. Tried most things, currently the only speed increase I can see is combining multiple audio buffers into one rather than repeating runFull for each file.

Apr 03 '23 18:04 maxaki

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build.

One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow #49 (comment) Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

I've been having issues getting this to work via .NET, will try a bit more then post back specific errors.

Apr 05 '23 19:04 adamreed90

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build. One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow #49 (comment) Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

I did some tests with that on RTX 3700 (8gb vram) and the output relative speed remained the same running 1 and 2 instances. The performance just cut in half when cloning. Tried most things, currently the only speed increase I can see is combining multiple audio buffers into one rather than repeating runFull for each file.

@maxaki Did you manage to get any improved performance out of concurrent transcriptions?

May 04 '23 05:05 adamreed90

Whisper Whisper copied to clipboard

Multiple GPUs of Same Name

Whisper
Whisper copied to clipboard