metal: add ops DIAG_MASK_INF, IM2COL_3D, fix op PAD
I wrote this to fix a problem I was having working with leejet/stable-diffusion.cpp. It may fix issues that other people are having, such as their #850 and #857.
As a user, I've already solved the issue for the audience that I care about. I'm offering this in hopes that it may be more helpful than merely opening an issue to complain about the missing/broken ops and going back to generating images of people with fish for heads walking down sidewalks. Yippee.
This commit is a (manual) octopus-merge of three independent commits, each of which could be converted into their own PR if this is overly broad.
test-backend-ops -b Metal -o IM2COL_3D, test-backend-ops -b Metal -o DIAG_MASK_INF, and test-backend-ops -b Metal -o PAD all passed on:
ggml_metal_device_init: GPU name: Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 19069.67 MB
see also, #17175 for CONV_2D