EGL-Registry icon indicating copy to clipboard operation
EGL-Registry copied to clipboard

EXT_device_enumeration: unclear how hot unplug works

Open emersion opened this issue 4 years ago • 8 comments

How should a driver handle device unplug when it supports EXT_device_enumeration?

  • Should the EGLDevice handles of the unplugged device remain valid?
  • When is it safe from a driver to invalidate an EGLDevice handle (and release associated resources)?
  • What happens when trying to use the unplugged EGLDevice?

This issue stems from https://gitlab.freedesktop.org/glvnd/libglvnd/-/issues/215. We need to be careful to pick behavior that is compatible with a vendor-neutral EGL loader such as glvnd.

For instance, it's not safe for a driver to invalidate an unplugged EGLDevice after the EGL client calls eglQueryDevicesEXT, because that would allow another driver to re-use the exact same handle and prevent the EGL client from figuring out the EGLDevice is gone or has changed. Or maybe glvnd should wrap vendor EGLDevices to prevent this?

cc @cubanismo @kbrenneman

emersion avatar Jan 21 '21 21:01 emersion

@jjulianoatnv may be interested here as well, as similar questions have been raised regarding VkPhysicalDevice objects in Vulkan, and we should probably resolve both issues in a compatible way.

cubanismo avatar Jan 21 '21 21:01 cubanismo

Another way of looking at the problem is:

  • If an application is using a device, how does it know that the devive is no longer valid?
  • How does an application know if a new device is available?
  • How should an application recover from a device becoming invalid?

You could invalidate an EGLDeviceEXT handle, as long as you never re-use that handle for a different device. If you allow re-using a handle, then applications would have to deal with a valid EGLDeviceEXT handle suddenly pointing to a different device, possibly between successive EGL calls.

For functions which take an EGLDeviceEXT handle, just returning an error code (probably EGL_BAD_DEVICE_EXT) could make sense. But, I don't know what should happen with an EGLDisplay that uses that device, especially if that EGLDisplay owns the current context.

kbrenneman avatar Feb 08 '21 23:02 kbrenneman

If an application is using a device, how does it know that the device is no longer valid?

After enumerating the list of devices, if an old device isn't advertised anymore, the old device is no longer valid.

How does an application know if a new device is available?

Out-of-scope. On Linux, udev can be used to monitor new devices appearing/disappearing.

How should an application recover from a device becoming invalid?
But, I don't know what should happen with an EGLDisplay that uses that device, especially if that EGLDisplay owns the current context.

It doesn't seem like this is related to EGLDevice.

These are relevant I think:

  • https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug
  • https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_robustness.txt

emersion avatar Feb 09 '21 14:02 emersion

If an application is using a device, how does it know that the device is no longer valid?

After enumerating the list of devices, if an old device isn't advertised anymore, the old device is no longer valid.

We'd still need to define what happens with EGL functions if a device was removed after the last call to eglQueryDevicesEXT.

We also don't want applications to be constantly polling eglQueryDevicesEXT, and it wouldn't be sufficient even if they did, since an unplug is still asynchronous. So, there needs to be some way for an application to know that it needs to call eglQueryDevicesEXT.

How does an application know if a new device is available?

Out-of-scope. On Linux, udev can be used to monitor new devices appearing/disappearing.

That's fair. Adding a new device doesn't affect what an application is doing with an existing device, so some asynchronous notification should be fine, and that notification is necessarily OS-specific.

How should an application recover from a device becoming invalid? But, I don't know what should happen with an EGLDisplay that uses that device, especially if that EGLDisplay owns the current context.

It doesn't seem like this is related to EGLDevice.

It's part of the same problem -- a EGLDisplay is still a top-level EGL object associated with a device. If that device becomes unusable, then the EGLDisplay also does, just like an EGLDeviceEXT.

kbrenneman avatar Feb 11 '21 16:02 kbrenneman

We'd still need to define what happens with EGL functions if a device was removed after the last call to eglQueryDevicesEXT.

EGL_BAD_DEVICE_EXT for device functions, robustness for EGLDisplay.

there needs to be some way for an application to know that it needs to call eglQueryDevicesEXT.

Yeah, but I don't think EGL should be involved. Clients will likely want to integrate hotplug detection into their event loop, and that's too system-specific for EGL to handle. Just ask users to use platform-specific APIs such as udev.

If you really think EGL should be involved, I'd suggest working on a separate extension, and not block this issue because of it.

It's part of the same problem -- a EGLDisplay is still a top-level EGL object associated with a device. If that device becomes unusable, then the EGLDisplay also does, just like an EGLDeviceEXT.

Yes, but it's not specific to the device platform. It may happen when the EGLDisplay was created for another platform, like X11/GBM/Wayland/surfaceless. The driver will pick a physical device under-the-hood, which may disappear without the client noticing.

emersion avatar Feb 15 '21 09:02 emersion

For adding a new device, we don't need EGL to notify the application. An application can carry on using whatever device it was using and still work just fine. If an application cares about new devices, then it can implement whatever OS-specific hotplug detection it needs, and then call eglQueryDevicesEXT again as necessary. I'll need to update libglvnd to deal with added devices, of course, but that'll be easy enough.

To deal with removing devices, I think we are going to want a new extension. If nothing else, an extension would be a good way to define the behavior that an application should expect. To provide a clean way for applications to cope with device removal, an extension could also define a new error code and/or a new query to distinguish between "that device handle is invalid" and "that device handle is valid, but the device disappeared when you weren't looking."

Anyway, this is what I was thinking for removal behavior. It's a rough sketch right now, but I can write this up a more formal spec if it sounds reasonable.

For EGLDeviceEXT handles:

  • After a device is removed, all EGL functions which take that EGLDeviceEXT handle will fail. With an extension, we can define a new EGL_DEVICE_LOST error code to use for this.
  • An application can check if a device is removed by calling eglQueryDevicesEXT. With an extension, we can also define a new query attribute for eglQueryDeviceAttribEXT, especially since eglQueryDevicesEXT can be pretty expensive.
  • A driver can re-use the same EGLDeviceEXT handle if the same device is later reconnected. A driver may not re-use the same handle for a different device.

For EGLDisplays:

  • If the device for a display is removed, then the EGLDisplay becomes invalid. The handle still exists, since you can't destroy EGLDisplay handles. All EGL functions which take that EGLDisplay will fail. Use the same error code that we'd use for device functions above.
  • With a new extension, an application could also check for a lost device using eglQueryDisplayAttribEXT.
  • If there's a current context from that display, then the context has a graphics reset, as per GL_KHR_robustness.
  • After an EGLDisplay becomes invalid, the only thing that an application can do with that display is to release any current contexts, and then to tear down the display with eglTerminate.
  • After calling eglTerminate, an application may try to reinitialize the display using eglInitialize. Depending on the display attributes, the driver may allow eglInitialize to succeed by picking a different device.
  • If eglInitialize fails, then an application could try getting a new display handle with eglGetPlatformDisplay. In practice, you'd probably just want to call eglGetPlatformDisplay unconditionally, since if the driver is capable of reinitializing the display, then it can just hand back that same EGLDisplay handle.

That last point is important for libglvnd. When an application calls eglGetPlatformDisplay, more than one vendor might be able to work with the native display, and libglvnd will just pick the first vendor that returns a non-NULL EGLDisplay handle. If the device behind that display disappears, then that first vendor might not have another device it can use, in which case eglInitialize will fail. But, when the application calls eglGetPlatformDisplay, then a different vendor could pick up the display instead, in which case it would return a new EGLDisplay handle.

kbrenneman avatar Feb 17 '21 18:02 kbrenneman

After a device is removed, all EGL functions which take that EGLDeviceEXT handle will fail. With an extension, we can define a new EGL_DEVICE_LOST error code to use for this.

Looks like a good idea.

With an extension, we can also define a new query attribute for eglQueryDeviceAttribEXT, especially since eglQueryDevicesEXT can be pretty expensive.

Hm. Something like an EGL_DEVICE_IS_ALIVE attrib? I'm not sure this is really needed (e.g. Vulkan doesn't have it).

A driver can re-use the same EGLDeviceEXT handle if the same device is later reconnected.

What if the same device is re-connected but on a different port?

Re-using handles between hotplugs sounds like a footgun TBH. What's the motivation?

emersion avatar Feb 23 '21 15:02 emersion

With an extension, we can also define a new query attribute for eglQueryDeviceAttribEXT, especially since eglQueryDevicesEXT can be pretty expensive.

Hm. Something like an EGL_DEVICE_IS_ALIVE attrib? I'm not sure this is really needed (e.g. Vulkan doesn't have it).

Something like that, yeah.

It wouldn't strictly be needed, since you could just call eglQueryDevicesEXT and look for the handle. Using a specific device query might be easier or faster, though.

A driver can re-use the same EGLDeviceEXT handle if the same device is later reconnected.

What if the same device is re-connected but on a different port?

Re-using handles between hotplugs sounds like a footgun TBH. What's the motivation?

Emphasis on "can" -- a driver would be allowed, not required, to re-use a handle for the same device.

The inverse is the critical part: A driver must not re-use the same handle for a different device, because if it did, then an application would suddenly be working with a different device between two EGL calls without any way to realize that anything changed.

If a driver re-uses a handle for the same device, then you'ref fine: An application using that handle would continue to use the same device just like it expects.

kbrenneman avatar Feb 23 '21 17:02 kbrenneman