Halide icon indicating copy to clipboard operation
Halide copied to clipboard

[vulkan phase2] Vulkan Runtime

Open derek-gerstmann opened this issue 2 years ago • 5 comments

This set of changes is Phase2 for adding Vulkan support, which adds the runtime class interfaces for interacting with the Vulkan API. Specifically, it adds the following files:

src/CodeGen_Vulkan_Dev.cpp
src/runtime/HalideRuntimeVulkan.h
src/runtime/mini_vulkan.h
src/runtime/vulkan.cpp
src/runtime/vulkan_context.h
src/runtime/vulkan_extensions.h
src/runtime/vulkan_functions.h
src/runtime/vulkan_interface.h
src/runtime/vulkan_internal.h
src/runtime/vulkan_memory.h
src/runtime/vulkan_resources.h
src/runtime/windows_vulkan.cpp

The forthcoming Phase3 will address testing and CI modifications.

PHASE2 TODO:

  • Refactored CodeGen
  • Docs on setup, config, usage

derek-gerstmann avatar Aug 05 '22 19:08 derek-gerstmann

C:\build_bot\worker\halide-testbranch-main-llvm16-x86-64-windows-cmake\halide-source\src\CodeGen_Vulkan_Dev.cpp(1082): error C2398: Element '3': conversion from 'SpvBuiltIn_' to '_Ty' requires a narrowing conversion with [ _Ty=uint32_t ]

steven-johnson avatar Aug 08 '22 17:08 steven-johnson

C:\build_bot\worker\halide-testbranch-main-llvm16-x86-64-windows-cmake\halide-source\src\CodeGen_Vulkan_Dev.cpp(1082): error C2398: Element '3': conversion from 'SpvBuiltIn_' to '_Ty' requires a narrowing conversion with [ _Ty=uint32_t ]

Oooh, thanks! I'll fix this!

derek-gerstmann avatar Aug 08 '22 19:08 derek-gerstmann

Generally looks very nice; my comments are mostly nits but a few actionable items. My main concern here is that there are a lot of TODO comments (and also some if 0 unimplemented sections) in Codegen_Dev, and it's not clear to me which/how-many of them are "need to complete before this will work" vs "this should be improved someday but isn't blocking a functional initial release" -- clarifying those would IMHO make this easier for other folks to read and grasp the current state.

Thanks! Okay, cool! Yes, the CodeGen_Dev is in the process of being refactored to use the SPIRV_IR classes and to address most of the TODO comments. Once this is in place, I'll update the docs and status for what's working, etc!

derek-gerstmann avatar Aug 08 '22 19:08 derek-gerstmann

Monday Morning Review Ping -- where does this PR stand?

steven-johnson avatar Aug 22 '22 16:08 steven-johnson

Monday Morning Review Ping -- where does this PR stand?

Apologies … I’m on vacation but will be back later this week! Most of the requested changes are pushed … remaining items are on the codegen, and docs!

derek-gerstmann avatar Aug 22 '22 16:08 derek-gerstmann

Ready for review!

derek-gerstmann avatar Sep 30 '22 21:09 derek-gerstmann

Linux failures appear to be real, PTAL

steven-johnson avatar Oct 03 '22 18:10 steven-johnson

Linux failures appear to be real, PTAL

Strange. The arm-64-linux failure seems to come from a typo in tests/correctness/target.cpp which isn't part of my changes. I'll merge with main again.

derek-gerstmann avatar Oct 03 '22 19:10 derek-gerstmann

Ready to merge! All tests are now passing on my tested configs for Linux & Windows using the target host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13 on LLVM v14.x.

More details in the VULKAN_readme.md regarding current limitations and caveats.

derek-gerstmann avatar Dec 05 '22 22:12 derek-gerstmann

Added issues: #7202 #7203

derek-gerstmann avatar Dec 05 '22 22:12 derek-gerstmann

Question for schedule estimation: after this PR lands, do we expect this to be a functional (although perhaps alpha- or beta- quality) backend? Or are there more things we know need to land for it to be usable?

steven-johnson avatar Dec 15 '22 18:12 steven-johnson

Question for schedule estimation: after this PR lands, do we expect this to be a functional (although perhaps alpha- or beta- quality) backend? Or are there more things we know need to land for it to be usable?

I'd consider this beta-quality as it is ... it's passing all tests on the platforms and devices I have access to, and it's more feature complete than the OpenGL compute backend. However, it needs more testing on other platforms/devices, and performance tuning before I'd call it production ready.

derek-gerstmann avatar Dec 15 '22 21:12 derek-gerstmann

You will need to sync this PR to main in order to build with top-of-tree LLVM -- ordinarily I'd just do that instead of suggesting it but there are enough conflicts that I suspect it would be better handled by you

steven-johnson avatar Feb 09 '23 21:02 steven-johnson

(I took the liberty of syncing this to main, since the conflicts in test/generator were a little subtle)

steven-johnson avatar Feb 13 '23 18:02 steven-johnson

(I took the liberty of syncing this to main, since the conflicts in test/generator were a little subtle)

Ahh, you beat me to it! Yea, I noticed the generator conflicts were a bit more involved. Thank you!

derek-gerstmann avatar Feb 14 '23 15:02 derek-gerstmann

There's a subtle but critical mistake in a lot of the runtime code (which is unfortunately easy to make):

When an error occurs in any Halide runtime code, you must do BOTH of these things:

  • call halide_error(), either directly, or indirectly (typically via the error() stream output)
  • AND, return a nonzero error code from the original entry point of the runtime call.

It's easy to leave one of these out, but failing to return a nonzero error code means that anything that replaces the error handler with something that doesn't abort won't see the errors (and may crash).

The specific case that caused me trouble was vulkan_memory.h:746, but from a quick glance, it looks like failing to return a nonzero error code is fairly pervasive in this PR (again, it's an easy mistake to make).

I'm surprised there doesn't seem to be any open Issue on this, so I'm opening one: https://github.com/halide/Halide/issues/7350

steven-johnson avatar Feb 14 '23 20:02 steven-johnson

When an error occurs in any Halide runtime code, you must do BOTH of these things: call halide_error(), either directly, or indirectly (typically via the error() stream output) AND, return a nonzero error code from the original entry point of the runtime call.

Ooooh! Thanks for clarifying … I definitely did not know that! I’ll make another pass and clean up the return codes so these propagate!

derek-gerstmann avatar Feb 14 '23 21:02 derek-gerstmann

@steven-johnson Please take a look at the error handling fixes when you get a chance, and let me know if I missed anything!

derek-gerstmann avatar Mar 17 '23 22:03 derek-gerstmann

@steven-johnson Please take a look at the error handling fixes when you get a chance, and let me know if I missed anything!

I am on vacation next week -- I probably won't be able to review properly until March 27.

(In the meantime, might make sense to merge this with main and resolve the conflicts)

steven-johnson avatar Mar 17 '23 23:03 steven-johnson

@steven-johnson Please take a look at the error handling fixes when you get a chance, and let me know if I missed anything!

I am on vacation next week -- I probably won't be able to review properly until March 27.

(In the meantime, might make sense to merge this with main and resolve the conflicts)

Okay, no worries! Okay will do … I’ll merge this with main and resolve the conflicts

derek-gerstmann avatar Mar 17 '23 23:03 derek-gerstmann

Apologies if this is not the appropriate place to mention this. I've been keeping an eye out for this PR since I'm interested in migrating some code to use Vulkan on Android. Unfortunately, I can see that the following extensions are being used:

            if (arg.type.bits() == 8) {
                builder.require_extension("SPV_KHR_8bit_storage");
            } else if (arg.type.bits() == 16) {
                builder.require_extension("SPV_KHR_16bit_storage");
            }

In my limited testing they are not widely available on Android devices. The only safe option has been to use uint32_t and read 2 (or 4) pixels at a time.

mirsadm avatar Apr 04 '23 09:04 mirsadm

Apologies if this is not the appropriate place to mention this. I've been keeping an eye out for this PR since I'm interested in migrating some code to use Vulkan on Android. Unfortunately, I can see that the following extensions are being used:

            if (arg.type.bits() == 8) {
                builder.require_extension("SPV_KHR_8bit_storage");
            } else if (arg.type.bits() == 16) {
                builder.require_extension("SPV_KHR_16bit_storage");
            }

In my limited testing they are not widely available on Android devices. The only safe option has been to use uint32_t and read 2 (or 4) pixels at a time.

Hi! Thanks very much for the feedback! The current implementation requires that the host target flag being used when compiling turns on the features for vk_uint8 and vk_uint16 to enable this codegen path, and then the runtime validates that the required extensions are available when attempting to execute the code. So, if you know that you're destined to run on a machine config that only has 32bit type support (which is the default assumption for vulkan), you shouldn't try and use 8bit or 16bit types. I agree we could do more work to try and pack smaller datatypes into larger ones for configurations that don't support 8-bit or 16-bit storage, but this would impose a performance hit to do the implicit packing / unpacking with masking and shifts. Is it acceptable for your use case to write your Halide algorithm such that it only uses 32-bit types for storage?

derek-gerstmann avatar Apr 04 '23 19:04 derek-gerstmann

Hi! Thanks very much for the feedback! The current implementation requires that the host target flag being used when compiling turns on the features for vk_uint8 and vk_uint16 to enable this codegen path, and then the runtime validates that the required extensions are available when attempting to execute the code. So, if you know that you're destined to run on a machine config that only has 32bit type support (which is the default assumption for vulkan), you shouldn't try and use 8bit or 16bit types. I agree we could do more work to try and pack smaller datatypes into larger ones for configurations that don't support 8-bit or 16-bit storage, but this would impose a performance hit to do the implicit packing / unpacking with masking and shifts. Is it acceptable for your use case to write your Halide algorithm such that it only uses 32-bit types for storage?

Hi! Thanks for the reply. Yes it is of course possible to use 32-bit types but in my brief testing the performance penalty due to the extra memory bandwidth far outweighs the extra compute to unpack the pixels. I'll give it test once it is merged to master :)

mirsadm avatar Apr 04 '23 19:04 mirsadm

in my brief testing the performance penalty due to the extra memory bandwidth far outweighs the extra compute to unpack the pixels

Understandable, but... if a device doesn't have the uint8/uint16 extensions, then what other alternative is there?

steven-johnson avatar Apr 04 '23 22:04 steven-johnson

in my brief testing the performance penalty due to the extra memory bandwidth far outweighs the extra compute to unpack the pixels

Understandable, but... if a device doesn't have the uint8/uint16 extensions, then what other alternative is there?

For what it’s worth, this Android developers guide seems to indicate that most shipping devices should have adequate support:

https://source.android.com/docs/core/graphics/implement-vulkan#versions

Note that v1.2 supports 8bit-storage

derek-gerstmann avatar Apr 04 '23 22:04 derek-gerstmann

Understandable, but... if a device doesn't have the uint8/uint16 extensions, then what other alternative is there?

With the devices I have at hand (Snapdragon 855/865), they support the feature shaderInt16 which allows them to use 16 bit integers in shader code. They are not able to use 16 bit buffers for input/output. Additionally, they support the extension VK_KHR_shader_float16_int8 which allows me to read 32 bit ints into 2 float16 and get better throughput. Vulkan and Android is a bit of minefield but this has been the only way I have managed to get real time performance on Android with broader support.

mirsadm avatar Apr 05 '23 08:04 mirsadm

Understandable, but... if a device doesn't have the uint8/uint16 extensions, then what other alternative is there?

With the devices I have at hand (Snapdragon 855/865), they support the feature shaderInt16 which allows them to use 16 bit integers in shader code. They are not able to use 16 bit buffers for input/output. Additionally, they support the extension VK_KHR_shader_float16_int8 which allows me to read 32 bit ints into 2 float16 and get better throughput.

Hmmm. I'm not sure how this might be added as a feature ... we'd need to know that you intend to pass a buffer of 32bit integers with the intent of indexing it as two float16s? And we'd have to know at CodeGen time, which would imply that you'd need some set of feature flags to trigger this. Feel free to add additional suggestions and use cases to the ticket I created: https://github.com/halide/Halide/issues/7478

derek-gerstmann avatar Apr 05 '23 22:04 derek-gerstmann

@steven-johnson I don't have write access to do the merge ... could you approve this PR when you get a chance?

derek-gerstmann avatar Apr 24 '23 20:04 derek-gerstmann

@steven-johnson I don't have write access to do the merge ... could you approve this PR when you get a chance?

Sure, but did the other reviewers approve it? It still says "needs 1 approving review"

steven-johnson avatar Apr 24 '23 21:04 steven-johnson

@steven-johnson I don't have write access to do the merge ... could you approve this PR when you get a chance?

Sure, but did the other reviewers approve it? It still says "needs 1 approving review"

Nope. Nobody that has made comments has hit the approval button, even though all comments were addressed so far.

derek-gerstmann avatar Apr 24 '23 21:04 derek-gerstmann