pycuda make_default_context() wasn't able to create a context

make_default_context() wasn't able to create a context

Open alievk opened this issue 5 years ago • 16 comments

When I run test_driver.py in docker container, it fails like this

ctx_maker = <function make_default_context.<locals>.ctx_maker at 0x7f04bfa5b2f0>

def make_default_context(ctx_maker=None):
    if ctx_maker is None:
        def ctx_maker(dev):
            return dev.make_context()

    ndevices = cuda.Device.count()
    if ndevices == 0:
        raise RuntimeError("No CUDA enabled device found. "
                "Please check your installation.")

    # Is CUDA_DEVICE set?
    import os
    devn = os.environ.get("CUDA_DEVICE")

    # Is $HOME/.cuda_device set ?
    if devn is None:
        try:
            homedir = os.environ.get("HOME")
            assert homedir is not None
            devn = (open(os.path.join(homedir, ".cuda_device"))
                    .read().strip())
        except:
            pass

    # If either CUDA_DEVICE or $HOME/.cuda_device is set, try to use it
    if devn is not None:
        try:
            devn = int(devn)
        except TypeError:
            raise TypeError("CUDA device number (CUDA_DEVICE or ~/.cuda_device)"
                    " must be an integer")

        dev = cuda.Device(devn)
        return ctx_maker(dev)

    # Otherwise, try to use any available device
    else:
        for devn in range(ndevices):
            dev = cuda.Device(devn)
            try:
                return ctx_maker(dev)
            except cuda.Error:
                pass

raise RuntimeError("make_default_context() wasn't able to create a context on any of the %d detected devices" % ndevices)
RuntimeError: make_default_context() wasn't able to create a context on any of the 8 detected devices

I use CUDA + OpenGL images from gitlab.com/nvidia/cudagl, particularly 10.0-devel-ubuntu18.04

pycuda sources are from the latest master branch, configured like this

./configure.py --cuda-root=/usr/local/cuda --cuda-enable-gl --cudadrv-lib-dir=/usr/lib/x86_64-linux-gnu

glxgears works, CUDA works (tested by pytorch)

glxinfo output https://pastebin.com/Qp1thpHB

I use xvfb to fake X server

Thanks!

Jul 04 '19 12:07 alievk

Pytorch may have run on an alternate backend. Could you try and run one of the CUDA SDK examples to verify? Also copy-paste the output of nvidia-smi?

Jul 04 '19 18:07 inducer

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                  Off |
| N/A   33C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:06:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:07:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:08:00.0 Off |                  Off |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-PCIE...  Off  | 00000000:0C:00.0 Off |                  Off |
| N/A   34C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-PCIE...  Off  | 00000000:0D:00.0 Off |                  Off |
| N/A   33C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-PCIE...  Off  | 00000000:0E:00.0 Off |                  Off |
| N/A   35C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-PCIE...  Off  | 00000000:0F:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I was able to run CUDA samples, particularly this https://github.com/NVIDIA/cuda-samples/tree/master/Samples/EGLStream_CUDA_Interop

CUDA documentation says that cuGLCtxCreate is depricated since CUDA 5, but pycuda refer to it https://github.com/inducer/pycuda/blob/d6fbd16387fe8792628c04c4f3754b39e24c8317/src/cpp/cuda_gl.hpp#L37 Could it be a problem?

Jul 05 '19 10:07 alievk

Could it be a problem?

Deprecated doesn't mean unsupported. Does the GL context sharing example work for you?

Jul 06 '19 17:07 inducer

Hello,

I think I have a similar issue.

A user is encountering a problem when trying to run https://github.com/koszullab/instaGRAAL/, which relies on pycuda. This user is using a Tesla V100 on a cluster.

This error comes from the following function:

import pycuda.driver as cuda
import pycuda.gl as cudagl

import OpenGL.GL
import OpenGL.GLU
import OpenGL.GLUT

def cuda_gl_init(self,):
        cuda.init()
        if bool(OpenGL.GLUT.glutMainLoopEvent):
            id_gpu = self.device
            curr_gpu = cuda.Device(id_gpu)
            logger.info("Selected_device: {}".format(curr_gpu.name()))
            self.ctx_gl = cudagl.make_context(
                curr_gpu, flags=cudagl.graphics_map_flags.NONE
            )
        else:
            import pycuda.gl.autoinit

            curr_gpu = cudagl.autoinit.device
            self.ctx_gl = cudagl.make_context(
                curr_gpu, flags=cudagl.graphics_map_flags.NONE
            )

When running the if case he gets:

INFO :: Selected_device: Tesla V100-SXM2-16GB Traceback (most recent call last): File "/usr/local/bin/instagraal", line 9, in load_entry_point('instagraal==0.1.6', 'console_scripts', 'instagraal')() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 2164, in main output_folder=output_folder, File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 208, in init self.cuda_gl_init() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 1448, in cuda_gl_init curr_gpu, flags=cudagl.graphics_map_flags.NONE pycuda._driver.Error: cuGLCtxCreate failed: unknown error

and when running the else case he gets:

Traceback (most recent call last): File "/usr/local/bin/instagraal", line 9, in load_entry_point('instagraal==0.1.6', 'console_scripts', 'instagraal')() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 2164, in main output_folder=output_folder, File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 208, in init self.cuda_gl_init() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 1451, in cuda_gl_init import pycuda.gl.autoinit File "/usr/local/lib/python3.6/dist-packages/pycuda-2019.1.2-py3.6-linux-x86_64.egg/pycuda/gl/autoinit.py", line 9, in context = make_default_context(lambda dev: cudagl.make_context(dev)) File "/usr/local/lib/python3.6/dist-packages/pycuda-2019.1.2-py3.6-linux-x86_64.egg/pycuda/tools.py", line 205, in make_default_context "on any of the %d detected devices" % ndevices) RuntimeError: make_default_context() wasn't able to create a context on any of the 1 detected devices

I believe this user has NVIDIA drivers 418.67 and CUDA 10.1. I personnally run this program on a desktop with a RTX2080Ti with NVIDIA drivers 440.59 and CUDA 10.2 and I have not encountered this issue.

Do you have any idea where the problem may come from?

May 19 '20 13:05 nadegeguiglielmoni

Tesla V100 on a cluster

Is their X server using the Nv driver? Do they even have an X server running? I would ask them to run one of the Nvidia GL interop examples directly. If those work, come back, and we can investigate further. Maybe encourage them to participate here directly.

May 19 '20 20:05 inducer

Hi @inducer ， I met the problem as @nadegeguiglielmoni mentioned above , I have the GPU on a cluster , and the nvidia-smi shows : +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:8B:00.0 Off | 0 | | N/A 31C P0 37W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ when I run the INstaGraal alone for help information , it works fine . But if I run it with my real data , it throws error ,saying

INFO :: Selected_device: Tesla V100-SXM2-16GB Traceback (most recent call last): File "/usr/local/bin/instagraal", line 9, in load_entry_point('instagraal==0.1.6', 'console_scripts', 'instagraal')() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 2164, in main output_folder=output_folder, File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 208, in init self.cuda_gl_init() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 1448, in cuda_gl_init curr_gpu, flags=cudagl.graphics_map_flags.NONE pycuda._driver.Error: cuGLCtxCreate failed: unknown error

follow the @nadegeguiglielmoni direction , I replace the line if bool(OpenGL.GLUT.glutMainLoopEvent): as mentioned above by if False: , it throws the error

Traceback (most recent call last): File "/usr/local/bin/instagraal", line 9, in load_entry_point('instagraal==0.1.6', 'console_scripts', 'instagraal')() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 2164, in main output_folder=output_folder, File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 208, in init self.cuda_gl_init() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 1451, in cuda_gl_init import pycuda.gl.autoinit File "/usr/local/lib/python3.6/dist-packages/pycuda-2019.1.2-py3.6-linux-x86_64.egg/pycuda/gl/autoinit.py", line 9, in context = make_default_context(lambda dev: cudagl.make_context(dev)) File "/usr/local/lib/python3.6/dist-packages/pycuda-2019.1.2-py3.6-linux-x86_64.egg/pycuda/tools.py", line 205, in make_default_context "on any of the %d detected devices" % ndevices) RuntimeError: make_default_context() wasn't able to create a context on any of the 1 detected devices

After that , I add one command cuda_Context.pop() to the upper line of If False: ,it shows another error

Traceback (most recent call last): File "/usr/local/bin/instagraal", line 9, in load_entry_point('instagraal==0.1.6', 'console_scripts', 'instagraal')() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 2165, in main output_folder=output_folder, File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 208, in init self.cuda_gl_init() File "/usr/local/lib/python3.6/dist-packages/instagraal-0.1.6-py3.6.egg/instagraal/instagraal.py", line 1443, in cuda_gl_init cuda.Context.pop() pycuda._driver.LogicError: context::pop failed: invalid device context - cannot pop non-current context

I don't quite understand what the X server means , and when I use top to see what is running , it shows like this ,

Tasks: 31 total, 1 running, 30 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.9 us, 0.0 sy, 0.0 ni, 97.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 62914560 total, 61282112 free, 422244 used, 1210204 buff/cache KiB Swap: 0 total, 0 free, 0 used. 62492316 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
231 root 20 0 119672 13160 6524 S 7.0 0.0 1:42.48 x11vnc
229 root 20 0 1295872 51692 19860 S 6.3 0.1 1:30.51 Xvfb
388 root 20 0 608780 24208 15064 S 1.3 0.0 0:13.09 xfce4-terminal
378 root 20 0 51740 11980 1876 S 1.0 0.0 0:08.63 python
86 www-data 20 0 141444 2876 1012 S 0.7 0.0 0:04.90 nginx
49 sysu_mh+ 20 0 19868 16692 0 S 0.3 0.0 0:24.74 ttyd
70 root 20 0 64468 18164 4024 S 0.3 0.0 0:03.97 supervisord
306 root 20 0 181548 12828 9020 S 0.3 0.0 0:02.66 xfwm4
4785 root 20 0 38704 1764 1304 R 0.3 0.0 0:00.02 top
1 root 20 0 10040 1728 1376 S 0.0 0.0 0:00.06 start-ubuntu180
38 root 20 0 19868 16692 0 S 0.0 0.0 0:24.10 ttyd
50 root 20 0 4512 384 320 S 0.0 0.0 0:00.29 tini
67 root 20 0 72284 1180 432 S 0.0 0.0 0:00.00 sshd
73 root 20 0 141108 6148 4624 S 0.0 0.0 0:00.02 nginx
74 root 20 0 155852 21496 5132 S 0.0 0.0 0:03.34 python
154 root 20 0 11292 316 0 S 0.0 0.0 0:00.00 ssh-agent
230 root 20 0 4616 664 564 S 0.0 0.0 0:00.00 sh
232 root 20 0 10036 1748 1372 S 0.0 0.0 0:00.02 bash
269 root 20 0 47208 11188 3524 S 0.0 0.0 0:01.17 python
275 root 20 0 246764 7844 6112 S 0.0 0.0 0:00.07 xfce4-session
296 root 20 0 45688 824 412 S 0.0 0.0 0:00.00 dbus-launch
297 root 20 0 47620 1532 1024 S 0.0 0.0 0:00.06 dbus-daemon
300 root 20 0 59344 2976 2344 S 0.0 0.0 0:00.04 xfconfd
304 root 20 0 11292 320 0 S 0.0 0.0 0:00.00 ssh-agent

I hope this will help you to find what is the reason why it fails . and this is instagraal.py attached to here.
test6.txt

May 20 '20 11:05 longzhangnation

Hi @inducer , My problem is solved by @nadegeguiglielmoni . Thank you both. May you a nice day.

May 20 '20 15:05 longzhangnation

Out of curiosity, could you share what ultimately solved the problem?

May 20 '20 18:05 inducer

The problem was not really solved, they just switched to a different version where the display is disabled so it's only a workaround.

May 20 '20 19:05 nadegeguiglielmoni

@inducer I think some users of gprMax (https://github.com/gprMax/gprMax) are experiencing something similar with certain GPUs - https://github.com/gprMax/gprMax/issues/270

I have done some very rudimentary checks with those reporting the problem, and it seems their GPU has problems at the point at which make_context() is called. I had them try the following very simple test:

import pycuda.driver as drv
drv.init()
dev = drv.Device(0)
print('Trying to create context....')
ctx = dev.make_context()
print(f'Context created on device: {dev.name()}')
ctx.pop()
del ctx
print('Context removed.\nEnd of test')

Any thoughts on the next steps to try?

Nov 19 '20 13:11 craig-warren

You could try the (relatively recent) retain_primary_context to create a context.

Nov 19 '20 14:11 inducer

Thanks, but seems to produce same result, i.e. test script returns to command prompt at retain_primary_context() call.

Stupid question: what is context creation used for? I just searched through the gprMax code I wrote several years ago, and I notice I create a context but never do anything explicitly with it, apart from pop and delete it.

Nov 19 '20 16:11 craig-warren

Do the CUDA SDK samples work on that machine? Particularly the one for the driver SDK?
Can you get a backtrace (via gdb) at the moment the program aborts?
Almost all CUDA calls (memory alloc, program load, stream creation) require an active context. The currently active context is implicit global state via the context stack, and hence not directly visible in API usage.

Nov 19 '20 19:11 inducer

@inducer many thanks for your comments, which I have passed on. If seems the CUDA SDK samples are working OK, see - https://github.com/gprMax/gprMax/issues/270#issuecomment-732124541

I'm not sure it is easy for the users to get gdb working on their Windows machines.

Nov 24 '20 12:11 craig-warren

Device(0)

I am getting same context issue and tried this code but still get this error- TypeError: pybind11::init(): factory function returned nullptr any suggestion,please.

Jul 09 '21 12:07 neso613

I found this issue occured at :

"/opt/conda/lib/python3.9/site-packages/pycuda-2022.2.2-py3.9-linux-x86_64.egg/pycuda/tools.py", line 230, in make_default_context

real exception is :

"cuGLCtxCreate failed: OS call failed or operation not supported on this OS" caused by dev.make_context()

My OS is ubuntu 20.04, cuda12

Aug 03 '23 04:08 dehengxu

pycuda pycuda copied to clipboard

make_default_context() wasn't able to create a context

pycuda
pycuda copied to clipboard