xserver icon indicating copy to clipboard operation
xserver copied to clipboard

Critical: modesetting driver refuses to give way to another driver and crashes

Open ONykyf opened this issue 4 weeks ago • 9 comments

Select the version

Git master branch

Describe your issue

modesetting driver on master tries to initialize its Screen 1 (modeset(1)) in parallel to Screen 0 for a preferred driver (NVIDIA(0) or RADEON(0)). On xlibre-xserver-25.0.0.16 (and previous versions) finally

(II) UnloadModule: "modesetting"
(II) Unloading modesetting
(II) UnloadModule: "vesa"
(II) Unloading vesa

is done, and everything is OK both for NVidia and Radeon. But on master modeset(1) goes on initializing. For NVidia, it just comes to

(II) modeset(1): Damage tracking initialized
(II) modeset(1): Setting screen physical size to 508 x 285
(II) Screen(s) initialized

and then to

(II) Input(s) initialized
(II) modeset(1): Disabling kernel dirty updates, not required.
(EE) modeset(1): failed to set mode: Invalid argument

which is the last message from this screen - not good, but no harm. But for Radeon, I get

(EE) modeset(1): drmSetMaster failed: Device or resource busy

Fatal server error:
AddScreen/ScreenInit failed for driver 1

https://github.com/X11Libre/xserver/pull/1479 has been applied to both master and xlibre-xserver-25.0.0.16, to avoid problems with registering private keys, otherwise crashes would appear earlier.

I attach logs with time marks stripped off, to make diffing easier.

Steps to reproduce

  1. Start X on Radeon HD 5700
  2. Observe xserver crashed

What did you expect?

modesetting driver to give way to others gracefully.

Additional Information

X logs:

xlibre-server-25.0.0.16 on NVidia GTX 760

Xorg.0.log-nvidia470-16-cleared.txt

xlibre-server-master on NVidia GTX 760

Xorg.0.log-nvidia470-master-cleared.txt

xlibre-server-25.0.0.16 on Radeon HD 5700

Xorg.0.log-radeon-16-cleared.txt

xlibre-server-master on Radeon HD 5700

Xorg.0.log-radeon-master-cleared.txt

Extra fields

ONykyf avatar Nov 30 '25 10:11 ONykyf

@ONykyf @cepelinas9000 Sounds like we're not dropping drm master if modesetting fails to initialize the screen.

Is this a problem only here, or does this break on Xorg too?

stefan11111 avatar Dec 01 '25 19:12 stefan11111

@ONykyf @cepelinas9000 Sounds like we're not dropping drm master if modesetting fails to initialize the screen.

Is this a problem only here, or does this break on Xorg too?

@stefan11111 @metux @cepelinas9000 No, I have found the cause. It is the removal of {fb,sbus,platform,pci}SlotClaimed. Therefore a slot once claimed is initialized again, which is clearly incorrect.

Why have these safety flags been removed?

ONykyf avatar Dec 02 '25 09:12 ONykyf

@ONykyf @cepelinas9000 Sounds like we're not dropping drm master if modesetting fails to initialize the screen.

Is this a problem only here, or does this break on Xorg too?

@stefan11111 @metux @cepelinas9000 No, I have found the cause. It is the removal of {fb,sbus,platform,pci}SlotClaimed. Therefore a slot once claimed is initialized again, which is clearly incorrect.

Why have these safety flags been removed?

They break in other cases.

See https://github.com/X11Libre/xserver/pull/998 and https://github.com/X11Libre/xserver/pull/1023

These checks would, for example, prevent the server from initializing with 2 slots of the same type, even if they are on different cards. Worse, if one driver manages to claim a slot, but later fails to initialize the screen, another driver can't claim a slot of the same type.

Edit: Or maybe not. If all the drivers claim pci slots only, then even if https://github.com/X11Libre/xserver/pull/998 is reverted, pciSlotClaimed wouldn't prevent another slot from being claimed.

Sounds like we have to find a way to tell the X server that the screen we're trying to initialize already has a driver, and to not try to load another driver. @metux Any ideas on how to solve this?

stefan11111 avatar Dec 02 '25 10:12 stefan11111

At least reverting https://github.com/X11Libre/xserver/commit/d09b3dae3e91bcc5c465c827950bcd40c1988093, https://github.com/X11Libre/xserver/commit/22d963bc4dc6d8ac810f07cd093bb440a937badf, and https://github.com/X11Libre/xserver/commit/d3fd8c385ba998f5abc333e1a16bb56388eccaae cures the present issue, probably reintroducing another ones. I tested this on NVidia and Radeon.

Anyway, a quick and radical way failed, so smth else should be elaborated. Using counters like pciSlotClaimed as guards is ugly and unreliable, it must be something more closely bound to specific devices.

ONykyf avatar Dec 02 '25 11:12 ONykyf

At least reverting d09b3da, 22d963b, and d3fd8c3 cures the present issue, probably reintroducing another ones. I tested this on NVidia and Radeon.

Are you sure all 3 reverts are needed? This one, at least, only kills the server: 22d963bc4dc6d8ac810f07cd093bb440a937badf

stefan11111 avatar Dec 02 '25 14:12 stefan11111

At least reverting d09b3da, 22d963b, and d3fd8c3 cures the present issue, probably reintroducing another ones. I tested this on NVidia and Radeon.

Are you sure all 3 reverts are needed? This one, at least, only kills the server: 22d963b

If two cards with the same driver are present, maybe, but I have no such collision.

I tried without it, and it fails. I left the three commits reverted as a temporary fix (just to make master work) and try to invent something decent for this purpose.

Edit. Oh, no, sorry, it was https://github.com/X11Libre/xserver/commit/d3fd8c385ba998f5abc333e1a16bb56388eccaae, not https://github.com/X11Libre/xserver/commit/22d963bc4dc6d8ac810f07cd093bb440a937badf, the latter can safely be dropped.

ONykyf avatar Dec 02 '25 14:12 ONykyf

Edit. Oh, no, sorry, it was d3fd8c3, not 22d963b, the latter can safely be dropped.

In that case, can you make a pr that reverts both commits, in case we can't come up with something better in time? 2 commits, one for each revert, with a big asterisk, both in the commit message and in the code, that this is a hack and should be fixed?

stefan11111 avatar Dec 02 '25 20:12 stefan11111

I wonder how these reverts help, though. Looking at the code, all they do is, if a non-fb slot is claimed on any card, do not allow an fb slot to be claimed.

What is claiming an fb slot in that log?

stefan11111 avatar Dec 02 '25 20:12 stefan11111

I also wonder how you got nvidia to take priority over modesetting:

(==) Matched nvidia as autoconfigured driver 0
(==) Matched nouveau as autoconfigured driver 1
(==) Matched modesetting as autoconfigured driver 2

It should be nouveau, then modesetting, and then nvidia: https://github.com/X11Libre/xserver/pull/1231

stefan11111 avatar Dec 02 '25 22:12 stefan11111

I also wonder how you got nvidia to take priority over modesetting:

(==) Matched nvidia as autoconfigured driver 0
(==) Matched nouveau as autoconfigured driver 1
(==) Matched modesetting as autoconfigured driver 2

It should be nouveau, then modesetting, and then nvidia: #1231

It is exactly what 10-nvidia.conf is for: if nvidia_drm drives a card, nvidia becomes a preferred driver and ModulePath is enhanced with an nvidia dir before other dirs.

ONykyf avatar Dec 03 '25 19:12 ONykyf

I wonder how these reverts help, though. Looking at the code, all they do is, if a non-fb slot is claimed on any card, do not allow an fb slot to be claimed.

What is claiming an fb slot in that log?

@stefan11111 @metux No need for this, I made a working prototype, just a demo that solves the issue without reverting:https://github.com/X11Libre/xserver/pull/1564

ONykyf avatar Dec 03 '25 19:12 ONykyf

It is exactly what 10-nvidia.conf is for: if nvidia_drm drives a card, nvidia becomes a preferred driver and ModulePath is enhanced with an nvidia dir before other dirs.

Is this done for all cards, or only for old cards that don't support egl with gbm, and hence don't have glamor acceleration for the modesetting driver?

stefan11111 avatar Dec 03 '25 20:12 stefan11111

It is exactly what 10-nvidia.conf is for: if nvidia_drm drives a card, nvidia becomes a preferred driver and ModulePath is enhanced with an nvidia dir before other dirs.

Is this done for all cards, or only for old cards that don't support egl with gbm, and hence don't have glamor acceleration for the modesetting driver?

By default the file is installed, but anyone not willing to use legacy nvidia drivers can simply remove the file or change a line in it, and its gets disabled.

I should document this one day.

ONykyf avatar Dec 03 '25 21:12 ONykyf