running multiple concurrent threads (each with their own model+context) throws exceptions
I tried running 2 threads - each allocating its own model and context - but calling context.runFull() from both threads at the same time causes exceptions.
Is this tested/supported? My GPU has more than enough VRAM and shaders to run several instances concurrently. This would be highly useful since it would increase throughput anywhere from 2x to 20x depending on GPU.
@philipag Should be fixed in version 1.10.
And it supports one more relevant feature: pass eGpuModelFlags.Cloneable when loading the model, call iModel.clone() method, then create two contexts from these two models. The tensors in VRAM are shared between devices, not duplicated.
Admittedly, I never tested that use case, might be bugs there.
@Const-me Thanks for the fast update. Things now run reliably with 2 or more threads. In some cases this seems to be shader-bound (with ggml-small an RTX 970 maxes out at 100% CPU usage with 2 threads). In other cases it appears to be memory bandwidth bound (with ggml-tiny an RTX 970 uses only about 50% GPU with 3 threads and adding threads starts to decrease throughput). In all cases GPU memory capacity is never fully utilized.
tiny-gml with 3 threads achieves about 32x realtime throughput and ggml-small with 2 threads achieves 12x.
In 3 weeks I will have an RTX 4090 which has >4x memory bandwidth so I will see if the ggml-tiny case really does scale about 4x compared to the 970 (perhaps even more due to fp16/fp32 differences?).
@philipag About the performance, you could try the advanced GPU flags. The defaults are only based on vendor ID, when I detect AMD GPU, I use different defaults.
That's not always optimal, because I have tested with just a few GPUs. See comments in issue https://github.com/Const-me/Whisper/issues/16
@Const-me It seems your defaults for my Nvidia RTX 970 are the best. Wave32/Wave64 makes little difference and UseReshapedMatMul slows things down about 25%.