dkms icon indicating copy to clipboard operation
dkms copied to clipboard

Failed dkms status or autoinstall returns code 0 instead of an error one

Open C0rn3j opened this issue 8 months ago • 9 comments

Both of these commands return code 0, they should return a non-zero return code, as they have errored.

# dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.

# dkms autoinstall
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.

This is on Arch Linux with dkms 3.0.11.

I skimmed changelog for the latest 3.0.12 which I did not test with, but it does not look like this issue was fixed there.

C0rn3j avatar Oct 18 '23 12:10 C0rn3j

Hello fellow Arch user. Can you share some idiot proof step-by-step reproducer steps?

Yes, I don't think we fixed anything like that with 3.0.12.

evelikov avatar Oct 19 '23 13:10 evelikov

I can't reproduce it with a fake module(same error, but return code 4), so I presume a condition is that a module already has to be installed, or some other weird stuff is going on.

I can reproduce it by breaking an existing nvidia module by pointing its source file to /dev/null

[0] % cd /var/lib/dkms/nvidia/535.113.01

[0] % sudo rm -f source; sudo ln -sf /usr/src/nvidia-535.113.01 source

[0] % dkms status     
nvidia/535.113.01, 6.1.58-1-lts, x86_64: installed
nvidia/535.113.01, 6.5.7-arch1-1, x86_64: installed

[0] % sudo rm -f source; sudo ln -sf /dev/null source                 

[0] % dkms status                                    
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.113.01/source/dkms.conf does not exist.

[0] % 

C0rn3j avatar Oct 19 '23 13:10 C0rn3j

The reproducer works for me. The error seems to be coming from the read_conf in module_status_built_extra().

All the other instances across the codebase are read_conf_or_die and a lot of that code is over 10 years old. Off the top of my head, I cannot see a reason why we couldn't flip the final user to "_or_die" variant.

@anbe42 IIRC you recently silenced dkms status so it doesn't show deprecation warnings - aka the read_conf that I'm thinking of, plus you did quite a lot of work around autoinstall (thanks again).

Do you have foresee any issues if we promote the error to being fatal?

evelikov avatar Oct 19 '23 17:10 evelikov

@scaronni if you have any input, that would be highly appreciated as well. Thanks o/

evelikov avatar Oct 19 '23 17:10 evelikov

Thinking about this a little more: autoinstall, explicitly aims to solder on, even when building/installing of specific module fails. So promoting the error to fatal does in the opposite direction.

On the other hand if dkms.conf is missing then the module is catastrophically broken.

@C0rn3j what did you do/what triggered the error on your end - was it manually tinkering around or something OS/packaging that caused it?

evelikov avatar Oct 20 '23 19:10 evelikov

I am not sure yet what triggered it, I just had a bunch of broken dkms builds on two machines for non-existent kernel and driver versions, I suspect some weird race condition prodded on by the kernel-modules-hook package.

C0rn3j avatar Oct 20 '23 21:10 C0rn3j

Looks like we have two things to fix here:

  • recovery from an (externally) broken /var/lib/dkms, aka dkms fsck
  • error propagation in such a case (the bug reported here)

A possibility how this broken state could have happened: Some packaging removed /usr/src/$driver-$oldversion upon some upgrade without calling the corresponding dkms remove hook first ... Should not happen with Debian packaged *-dkms modules, but I don't know what else is out there in the wild ...

anbe42 avatar Oct 20 '23 21:10 anbe42

Indeed splitting this in two makes sense. Recovery would be great, although since the base information is missing aka dkms.conf I don't know what we can do here.

Looking from the latter point, we already exit in all the other instances of missing dkms.conf. So it's a case of making those non-fatal and then fixing the almost impossible to test error paths or flipping the final one.

Browsing across the Arch packages:

  • kernel-modules-hook - touches only /usr/lib/modules making and restoring backups
  • nvidia-dkms - the one that was likely removed
  • dkms itself has separate hook/script, which does manual parsing/handling (akin to autoinstall) ensuring depmod is called only once per kernel, even if XXXs dkms modules are added/removed.

AFAICT autoinstall does not exist as far as Arch is concerned, although the extra script does call dkms status.

The pacman hook triggering the script is post transaction for install, and pre transaction for update/remove, so it cannot be the one causing the issue.

Considering there is no obvious way how this can happen (in Arch and Debian), outside of user error (it's fine, I'm not trying to blame anyone here) I'm inclined make it fatal error. If it turns out there's some valid use-case we can quickly revert it.

That said, let's leave this issue open for a while and see how things go.

evelikov avatar Oct 21 '23 13:10 evelikov

# 3.0.12
[0] % sudo dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/550.54.14/source/dkms.conf does not exist.
# 3.0.13
[0] % sudo dkms status
nvidia/550.54.14: broken
Error! nvidia/550.54.14: Missing the module source directory or the symbolic link pointing to it.
Manual intervention is required!
nvidia/550.67, 6.6.23-1-lts, x86_64: installed
nvidia/550.67, 6.7.9-arch1-1, x86_64: installed (original_module exists) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!)
nvidia/550.67, 6.8.2-arch2-1, x86_64: installed

Now with the new release, status goes through everything instead of instantly crashing, which will hopefully make this a bit nicer to debug...

Still haven't found how why this happens, but it does keep happening.
I have freshly installed .13 so all of this is created with .12:

image

C0rn3j avatar Apr 01 '24 10:04 C0rn3j