ZLUDA
ZLUDA copied to clipboard
DLSS Support Outline
What would it realistically take to implement support for DLSS within this project? Could we make use of DXVK-NVAPI for such feature?
This is a wonderful project and I’m looking forward to contributing some of my own work to it.
I am very curious about this question as well. Especially since the README contains the following text:
Realistically, it's now abandoned and will only possibly receive updates to run workloads I am personally interested in (DLSS).
Being able to run DLSS on AMD hardware would be a game-changer for sure, and I am curious what it will take to get there. Amazing project, and I am hoping this won't die now that funding has stopped.
Attempt no.1 (Base ZLUDA build)
Details: DXVK-NVAPI Requires patches in order to bypass AMD GPU Detection, The rest of the libraries work fine. GPU Arch is set to NV_GPU_ARCHITECTURE_TU100.
Observation: The game(s) fail at NVSDK_NGX_D3D12_GetFeatureRequirements. 2 Examples are provided (Minecraft RTX, R&C Rift Apart). Although preliminary initialization does succeed.
I did not take notes last time I poked there, so it might not be 100% accurate, but:
- Don't use a full game, rather start with DLSS 2 sample app (https://github.com/NVIDIA/DLSS/releases)
- I use D11 api names, but you should try Vulkan on Linux
- DLSS calls can be intercepted at nvapi level or CUDA dark api. I did both and I recommend going down the nvapi path. CUDA dark api functions for DLSS is partially mapped out in ZLUDA, but it's a pain. nvapi has all the nice names and it's clear what you are using
- Your first goal is to dump DLSS kernels: just break on
NvAPI_D3D11_CreateCubinComputeShaderWithNameand dumppCubinargument, it's a nice PTX assembly and that's how far I got - The PTX you get looks ostensibly normal, but it's dodgy like this:
cvt.u64.u32 %rd58, %r86; ... sust.p.2d.v4.b32.zero [%rd58, {%r195,%r196}], {%f576,%f1418,%f1418,%f1418}; - Now, my goal was to have nvapi-without-nvapi. Meaning imlementing nvapi functions in terms of CUDA functions, so
NvAPI_D3D11_CreateCubinComputeShaderWithName->cuModuleLoadData+cuModuleGetFunction,NvAPI_D3D11_LaunchCubinShader->cuLaunchKernel. Might require touching up PTX manually or doing some pre/post processing on the host side. This is all to prove that we fully understand DLSS API flow and there's now action at the distance. Might not be possible. - Next step would be to compile the PTX on ZLUDA. There are some missing instructions.
tex.with LOD,mma.orwmma.. At this point they don't have to be fast, just correct - Now you need to figure out how to interop with Vulkan. What you need is an API where you can post an arbitrary ZLUDA kernel to Vulkan command list. This probably does not exist, but since you are on Linux and have the sources to the Vulkan implementation then you can create a temporary implementation. The problem is with emitting the ZLUDA kernel. It must ABI-match whatever your Vulkan implementation is expecting. I've seen
amdgpu_cscalling convention in LLVM, maybe that's enough? zluda_apiis the ZLUDA subproject with enough nvapi implemented for the purpose of debugging and development of DLSS
Regarding the Vulkan interop it seems hip hould support it, Although I'm not sure if a kernel can be bound to an arbitrary command list as you specify. See example here
D3D12 interop should also be available on windows. There are no HIP samples but the recently released Orochi 2.0 does contain a DX12 sample: here
I ran some tests and want to report my findings which are very bare-bones.
I tried to run the DLSS sample that was linked on an AMD gpu on linux. The first issue is that vulkan will require a few extensions that AMD GPUs do not have:
VK_NVX_binary_import- Allows applications to import CuBIN binaries and execute them.VK_NVX_image_view_handle- Allows applications to query an opaque handle from an image view for use as a sampled image or storage image
I patched the source to bypass those checks, along with some other changes:
- patching
NGXWrapper::IsFeatureSupportedto return always true - setting
deviceParams.swapChainBufferCount = 3; - patching
surfaceCaps.maxImageCount < m_DeviceParams.swapChainBufferCounttosurfaceCaps.maxImageCount != 0 && surfaceCaps.maxImageCount < m_DeviceParams.swapChainBufferCount - disabling
.setBufferDeviceAddressMultiDevice(true); - changing shadow map and depth buffer formats from
nvrhi::Format::D24S8tonvrhi::Format::D32S8(fixes broken rendering)
After this I finally got the sample running, sadly I did not get any calls to NVApi. This is likely because NGX does a feature check and refuses to initialize on unsupported GPUs. At this point I have some questions:
- Is it possible to spoof GPU and supported vulkan extensions? (I tried
MESA_EXTENSION_OVERRIDEbut no luck I assume It's OpenGL only) - Has anyone managed to intercept DLSS-related NVAPI calls on an AMD GPU?
- What do you think the best strategy going forward would be?
- Is the situation simpler with DirectX?
I think there are three options:
- Find out how to patch the necessary checks in NGX
- Find a way to spoof vulkan extensions and GPU
- Modify the radv/mesa to report those extensions as available
Update:
Compiling mesa from source was surprisingly easy. I updated radv to report it supports the two extensions mentioned above. I also figured out on linux the architecture check is done using libnvidia-ml instead of NVAPI. I updated the zluda implementation of that lib to report my card as turing.
This got me further but DLSS support is still not being reported. I feel NVSDK_NGX_VULKAN_GetFeatureRequirements is what's locking up the system. I'll try to find a way to enable some more logging in NGX, maybe it will shed some light on what's going wrong.
Update 2:
After running the demo with logging by using the __NGX_LOG_LEVEL=10 environment variable I was able to pinpoint why the application is failing.
These are the relevant snippets:
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCreateCuModuleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCreateCuFunctionNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkDestroyCuModuleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkDestroyCuFunctionNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCmdCuLaunchKernelNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkGetImageViewHandleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkGetImageViewAddressNVX
[tid:124644461002368][CreateKernel:388] Error: vkCreateCuModuleNVX() and/or vkCreateCuFunctionNVX() functions are not available. Request VK_NVX_binary_import extension
This brings us to a point where we have to implement these functions into radv. Before doing that I thought it would be wise to test the Vulkan/HIP interop example I linked to above. And sadly after compiling the program crashes with the following error:
hipVulkan: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.2/rocclr/device/rocm/rocdevice.hpp:241: virtual bool roc::NullDevice::importExtSemaphore(void**, const amd::Os::FileDesc&, amd::ExternalSemaphoreHandleType): Assertion `false && "ShouldNotReachHere()"' failed.
It appears hipImportExternalSemaphore and a few other functions are not implemented in the rocm clr backend but only in the PAL one. Which I was unable to easily compile, so it will be for another day. I opened an issue about it https://github.com/ROCm/clr/issues/72
It is likely the HIP/Vulkan interop functions require a special build of rocm-clr and AMDVLK drivers instead of RADV. I guess the next step will be attempting to get that working, followed by compiling a custom version of AMDVLK that supports the required NVX extensions.
Windows might also be a a better platform to target, as I bet the interop will work. But I have no idea if there is a way to inject extensions into the Vulkan/DX12 implementation.
I managed to get a vulkan layer 'working', at least I can get calls passed from the DLSS example to ZLUDA. I now got the PTX dumped.
I had to implement/stub the following instructions:
tex.basesust.p
I am now missing the following:
cvt.rn.f16x2.f32ldmatrixmmafma.relucp.async.cg
@vosen My question now is about cvt with f16x2. What do you think is the best way to implement it, as it has 3 arguments instead of the usual two for cvt.
Is it better to create a new struct like this:
pub struct Arg3Cvt<P: ArgParams> {
pub dst: P::Operand,
pub src1: P::Operand,
pub src2: option<P::Operand>,
}
and substitute it as argument for all versions of cvt. Or is it better to make a new cvtHalf instruction?
Or is there a third better way, I'm still learning rust.
Best to decide depending on what sort of LLVM bitcode do we need to emit for it. as::CvtDetails already splits conversion case into separate variants (IntFromInt, FloatFromFloat, IntFromFloat, FloatFromInt) which are handled differently. One more distinct variant is going to fit ok with the rest of the code. On the other hand adding a distinct ast::Instruction variant just for cvt.rn.f16x2.f32 is fine too - there is no reason for other cvt variants to care about the src2 that will always be None and there's already precedence for this sort of codeling in the codebase (e.g. Atom vs AtomCas`).
Both solutions are good, choose whichever turns out to be more convenient.
So after rollback is it still possible working on DLSS or is it a far goal to achieve now?
So after rollback is it still possible working on DLSS or is it a far goal to achieve now?
It’s pretty much a pipe dream for the foreseeable future.