llama.cpp Vulkan Optimizations and Fixes

I have implemented a number of Vulkan optimizations and fixes:

Implement REPEAT operator shader to fix low performance of Vulkan copy-based implementation
Use GLSL FMA instruction where possible
Add GGML_VULKAN_PERF option to get approximate performance data about a running model
Rework and fix Vulkan Descriptor Set handling, this improves performance in my tests on AMD RADV
Fix validation error on float32 concat f16 shader

I will keep this on draft while I check a few more things, but feel free to test and benchmark. Don't expect a huge difference.

Aug 09 '24 20:08 0cc4m

I missed a validation issue in #8943, but the fix is now in this branch. I think this should be ready for a review and then merge.

Aug 11 '24 09:08 0cc4m

@ggerganov @slaren Can one of you review the non-Vulkan parts of this PR and approve if that's fine?

Aug 14 '24 14:08 0cc4m