ktxTexture_GLUpload fails to find GL symbols with NVIDIA driver on Linux
This is a followup for https://github.com/KhronosGroup/KTX-Software/discussions/707
I've been seeing "Could not load OpenGL command: glBindTexture!" when calling ktxTexture_GLUpload, although by using GLAD I can call glBindTexture just fine.
I reproduced the issue here: https://github.com/viseztrance/ktx-loader-issue
I also compiled the code on ubuntu 20.04 in a virtual machine and it worked there, and then tried the binary on my machine where it failed complaining about glBindTexture.
Then I tried the binary on several other machines (full breakdown bellow), and I think it might be an issue related to the nvidia drivers (ver. 530.41.03)?
| Operating System | Hardware | Status |
|---|---|---|
| Ubuntu 20.04 | virtual machine | OK |
| Fedora 38 | nvidia rtx 3060 | FAIL |
| Fedora 38 | nvidia gtx 1060 (3GB) | FAIL |
| Fedora 38 | intel | OK |
| Fedora 38 | intel | OK |
| Fedora 33 | intel | OK |
Some other things I tried:
- removing glad had no effect.
- compiling KTX dynamically also had no effect
I have figured this out and it happens because I wasn't passing: -lGL to the compile flags.
Though to be clear, without passing it, opengl was working fine. It's just that the function lookup fails on nvidia video cards.
So the following printed that the function was defined on intel, but not on nvidia. Passing -lGL would work on both.
#include <dlfcn.h>
#include <iostream>
#include "GLFW/glfw3.h"
int main() {
glfwInit();
GLFWwindow* window = glfwCreateWindow(800, 600, "Test", nullptr, nullptr);
if (window == nullptr) {
printf("Could not create window!\n");
glfwTerminate();
return -1;
}
void* handle = dlopen(nullptr, RTLD_LAZY);
auto f = dlsym(handle, "glBindTexture");
if (!f) {
printf("No func defined!\n");
} else {
printf("Func WAS defined!\n");
}
glfwTerminate();
return 0;
}
It turns out that adding -lGl only fixed the issue when compiling on Fedora. Compiling on standard ubuntu 20.04 (libgl-dev, libgl1-mesa-dev installed), won't actually link it, so running will trigger the same error.
ldd shows the following (which is consistent with other programs such as unity3d games):
linux-vdso.so.1 (0x00007ffd83fef000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f9ed3857000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9ed3852000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f9ed2800000)
libm.so.6 => /lib64/libm.so.6 (0x00007f9ed3771000)
libc.so.6 => /lib64/libc.so.6 (0x00007f9ed261e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9ed3883000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9ed374d000)
~~This change does fix things for me, but it would be nice for it to be solved upstream (with proper error handling and such)~~. While the following works, it doesn't reuse the shared gl context...
void *libGL = dlopen("libGL.so", RTLD_NOW | RTLD_GLOBAL);
#define GL_FUNCTION(type, func, required) \
gl.func = (type)dlsym(libGL, #func); \
The issue is that dlsym is not finding the OpenGL symbols. It is nothing specifically to do with glBindTexture.
The tests we use all run on Ubuntu 22.04 and ran on Ubuntu 20.04 when we were using that. We need to figure out what is different in your case. Some questions:
- Does the problem still only appear with NVIDIA drivers?
- Is your application loading the OpenGL shared library, creating a GL context and making that context current before calling
ktxTexture*_GLUpload? - When it calls
dlopento load OpenGL does it set RTLD_GLOBAL?
The only graphics devices/drivers I have available are an Apple M2 chip and an x86_64 with Intel integrated GPU so if it is NVIDIA-related all I can do is guess at the problem.
Some things to try:
- Test with a libktx.so instead of a static library and let me know the result.
- Run ldd after your app has called glfwInit and see if OpenGL is still missing from the list.
We cannot use the fix you are using because the GLUpload code is for both OpenGL and OpenGL ES.
I created a repo with a minimal example a while back https://github.com/viseztrance/ktx-loader-issue (the dangling reference compilation flag used there was due to an fmt compilation error I received back then, related to a g++ bug).
- Yes, the problem manifests only on nvidia, intel works fine.
- The gl context is created before invoking any ktx function
dlopenusedRTLD_LOCAL. However, I tried changing RTLD_LOCAL to RTLD_GLOBAL in both glad, and glfw and it doesn't fix the issue.
Switching to clang++ also fails. Compiling dynamically also doesn't work (also made sure to remove KHRONOS_STATIC from the code while doing so).
As I said, -lGL fixes the issue, but not always (ex. ubuntu 20.04).
If I comment out #define glBindTexture gl.glBindTexture and all the other lines from gl.funcs.h, -lGL will link properly, so it will work even on ubuntu 20.04.
I'm sure there's a nice way of forcing linking to libgl, but I gave this some thought and I think that you shouldn't need to use -lGL in the first place, because it also adds extra things like libX11 as dependencies. If you want to support both wayland and xorg, it doesn't feel right for someone running wayland to have to install xorg libs.
Strangely dlopen("libGL.so", RTLD_NOW | RTLD_GLOBAL); instead of using null actually breaks on 2 of the 3 intel machines I tried, so that's definitely not a good solution for desktop either.
@viseztrance thank you for your reply and thank you for trying with libktx.so.
You are right that you shouldn't need to use -lGL. I happen to do so though in my test apps when building for Ubuntu along with #define GL_GLEXT_PROTOTYPES.
Commenting out the lines in gl_funcs.h is not viable. It would force every user of static libktx to link with OpenGL and require OpenGL be installed anywhere the shared library might be installed.
The short test app you gave in your comment of May 30th shows this is not a libktx problem per se. It shows the same problem. It demonstrates that the main program does not apparently contain the OpenGL symbols. Why not?
Please make the test app pause before exiting then run lsof -p <pid> in a console, where pid> is the app's process id to see what shared libraries it has loaded. Does it have OpenGL loaded? Try it on NVIDIA and other drivers.
I've just spent about 15 minutes trying to find where glfw loads OpenGL but failed. As I think you've have more familiarity with it I'll ask you some questions:
- When glfw loads OpenGL what does its call to dlopen() look like?
- Is no. 1 the same for both Wayland and X11?
- Does it use RTLD_LAZY?
- Does it have NVIDIA specific code anywhere?
- Does it use dlmopen() not dlopen()?
- How does it look up the symbols?
- Does it dlclose() the library after initializing its function pointers or at any point before termination?
None of these, except no. 4, would explain why it only fails on NVIDIA drivers but might give a hint.
Apologies, I think I misled you with my last comment. Both glad and glfw are using dlopen with RTLD_LOCAL to load libraries. However, only glad was loading opengl, but I changed them for both to make sure.
In other words, I don't think gflw creates any opengl context, it only provides the window to associate it with (at least that's my understanding of it).
This is the output of lsof -p. nvidia-no-lgl.txt is the one failing.
Proprietary nvidia drivers: nvidia-lgl.txt nvidia-no-lgl.txt
Noveau drivers (same machine): noveau-lgl.txt noveau-no-lgl.txt
Intel drivers: intel-lgl.txt intel-no-lgl.txt
Even though this is failing, opengl is loaded in my main context and it's working fine, which is why is strange RTLD_GLOBAL didn't fix it.
I'll try to pass the loaded opengl functions to the ktx lib somehow to solve it for my use case for now.
Thank you. Neither glad nor glfw appear in the lists so it looks like you are linking them statically. That means RTLD_GLOBAL has no effect, if it is one of those that is loading OpenGL, because the object loading OpenGL is the main program object and libktx is part of that object.
I need to know which of gltf or glad or possibly both is loading OpenGL and I need to know how they are retrieving the OpenGL symbols. Also does the problem occur in both Debug and Release configs?
I tried to follow where opengl is being loaded, and it's not in glad (it never calls dlopen regardless if I compiled with -lGL or not, among other things). Where in glfw is difficult to tell, because I'm guessing it's related to its x11_window.c ...
The problem occurs in both Debug and Release - I tried compiling with -DCMAKE_BUILD_TYPE=Release and -DCMAKE_BUILD_TYPE=Debug, and main program with / without -g.
These lines in gladLoaderLoadGL are probably the reason:
if (did_load) {
gladLoaderUnloadGL();
}
Why the problem appears only on NVIDIA I don't know. Maybe the other drivers dlopen OpenGL.
did_load is set to non-zero if the library was not loaded at the start of the function. The only function that sets the global variable indicating the library is loaded is glad_gl_dlopen_handle which appears to be an internal function. The only function that calls it is gladLoaderLoadGL so I don't see how the library can ever be loaded when that function is called. If you can call glad_gl_dlopen_handle from your application I think that will fix the problem.
Thank you for looking further into this.
I may be wrong, but I think that gladLoaderLoadGL is to be used if no opengl context is loaded, and gladLoadGL is meant to used with an existing one (like glfw). I was using the latter function.
Now, I have tried them both, including calling glad_gl_dlopen_handle but it didn't work.
gl3w (which is simpler than glad) also exhibits the same issue.
GLFW only loads libGLX.so then uses dlsym on that to retrieve the glXGetProcAddress which it uses to retrieve the OpenGL function pointers. I suspect that NVIDIA's libGLX implementation, libGLX_nvidia.so, is loading whatever library has the OpenGL symbols with RTLD_LOCAL so they are not visible outside the .so. This is legit. The OpenGL and GLX specs say apps should use glXGetProcAddress to retrieve OpenGL function pointers.
Originally using libktx's GLUpload required apps to link with OpenGL and it did not dynamically load function pointers. When I removed the link requirement and added dynamic loading I had to do so in a way that would not break existing apps. Given the wide range of official ways of querying OpenGL function pointers across platforms and the wide range of function- pointer wrangling libraries it was and is not reasonable to for GLUpload to second guess the app and attempt to figure out the mechanism the app is using. So I settled on searching the application module via dlsym as the app has to have the OpenGL context set up before calling GLUpload. Until your report it seemed to have been working fine. I'm surprised it has only happened now as libktx has been this way for a number of years and NVIDIA is a very popular OpenGL provider.
For now please try void *libGL = dlopen("libGL.so", RTLD_NOW | RTLD_GLOBAL); in your app before calling GLUpload the first time. If this works, RTLD_LAZY should work too and will be faster.
The only fix I can think of for libktx is to provide a loadGL function you can call to which you provide the GetProcAddress function to use just as you are doing now with Glad. If not called, the library will use dlsym on the app module as it does now. What do you think?
BTW your minimal example works (after I fixed some compile issues) under Ubuntu 22.04 under Parallels Desktop on my M2 MacBook.
I can confirm that calling dlopen("libGL.so", RTLD_LAZY | RTLD_GLOBAL); in my main code fixes the issue!
Passing GetProcAddress similar to gladLoadGL / gl3wInit2 would be ideal for me.
Im not surprised that your ubuntu test worked, because it's only the proprietary nvidia drivers having issues. The open source noveau ones work on the same hardware.
Thanks for the report. Please try creating a Wayland window instead of X11, without the dlopen, and let me know if it works or not on the NVIDIA driver.
Wayland and XWayland are the same as they are under xorg. They fail unless I call dlopen.
Basically I used glfwInitHint(GLFW_PLATFORM, GLFW_PLATFORM_WAYLAND) before init, and called glfwGetWaylandWindow to check that I have a native wayland window.
Wayland and XWayland are the same as they are under xorg. They fail unless I call dlopen.
Thanks @viseztrance.