dkms
dkms copied to clipboard
Failed dkms status or autoinstall returns code 0 instead of an error one
Both of these commands return code 0, they should return a non-zero return code, as they have errored.
# dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.
# dkms autoinstall
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.
This is on Arch Linux with dkms 3.0.11.
I skimmed changelog for the latest 3.0.12 which I did not test with, but it does not look like this issue was fixed there.
Hello fellow Arch user. Can you share some idiot proof step-by-step reproducer steps?
Yes, I don't think we fixed anything like that with 3.0.12.
I can't reproduce it with a fake module(same error, but return code 4), so I presume a condition is that a module already has to be installed, or some other weird stuff is going on.
I can reproduce it by breaking an existing nvidia module by pointing its source file to /dev/null
[0] % cd /var/lib/dkms/nvidia/535.113.01
[0] % sudo rm -f source; sudo ln -sf /usr/src/nvidia-535.113.01 source
[0] % dkms status
nvidia/535.113.01, 6.1.58-1-lts, x86_64: installed
nvidia/535.113.01, 6.5.7-arch1-1, x86_64: installed
[0] % sudo rm -f source; sudo ln -sf /dev/null source
[0] % dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.113.01/source/dkms.conf does not exist.
[0] %
The reproducer works for me. The error seems to be coming from the read_conf
in module_status_built_extra()
.
All the other instances across the codebase are read_conf_or_die
and a lot of that code is over 10 years old. Off the top of my head, I cannot see a reason why we couldn't flip the final user to "_or_die" variant.
@anbe42 IIRC you recently silenced dkms status
so it doesn't show deprecation warnings - aka the read_conf
that I'm thinking of, plus you did quite a lot of work around autoinstall (thanks again).
Do you have foresee any issues if we promote the error to being fatal?
@scaronni if you have any input, that would be highly appreciated as well. Thanks o/
Thinking about this a little more: autoinstall, explicitly aims to solder on, even when building/installing of specific module fails. So promoting the error to fatal does in the opposite direction.
On the other hand if dkms.conf is missing then the module is catastrophically broken.
@C0rn3j what did you do/what triggered the error on your end - was it manually tinkering around or something OS/packaging that caused it?
I am not sure yet what triggered it, I just had a bunch of broken dkms builds on two machines for non-existent kernel and driver versions, I suspect some weird race condition prodded on by the kernel-modules-hook
package.
Looks like we have two things to fix here:
- recovery from an (externally) broken /var/lib/dkms, aka
dkms fsck
- error propagation in such a case (the bug reported here)
A possibility how this broken state could have happened: Some packaging removed /usr/src/$driver-$oldversion upon some upgrade without calling the corresponding dkms remove hook first ... Should not happen with Debian packaged *-dkms modules, but I don't know what else is out there in the wild ...
Indeed splitting this in two makes sense. Recovery would be great, although since the base information is missing aka dkms.conf I don't know what we can do here.
Looking from the latter point, we already exit in all the other instances of missing dkms.conf. So it's a case of making those non-fatal and then fixing the almost impossible to test error paths or flipping the final one.
Browsing across the Arch packages:
-
kernel-modules-hook
- touches only/usr/lib/modules
making and restoring backups -
nvidia-dkms
- the one that was likely removed -
dkms
itself has separate hook/script, which does manual parsing/handling (akin to autoinstall) ensuringdepmod
is called only once per kernel, even if XXXs dkms modules are added/removed.
AFAICT autoinstall
does not exist as far as Arch is concerned, although the extra script does call dkms status
.
The pacman hook triggering the script is post transaction for install, and pre transaction for update/remove, so it cannot be the one causing the issue.
Considering there is no obvious way how this can happen (in Arch and Debian), outside of user error (it's fine, I'm not trying to blame anyone here) I'm inclined make it fatal error. If it turns out there's some valid use-case we can quickly revert it.
That said, let's leave this issue open for a while and see how things go.
# 3.0.12
[0] % sudo dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/550.54.14/source/dkms.conf does not exist.
# 3.0.13
[0] % sudo dkms status
nvidia/550.54.14: broken
Error! nvidia/550.54.14: Missing the module source directory or the symbolic link pointing to it.
Manual intervention is required!
nvidia/550.67, 6.6.23-1-lts, x86_64: installed
nvidia/550.67, 6.7.9-arch1-1, x86_64: installed (original_module exists) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!)
nvidia/550.67, 6.8.2-arch2-1, x86_64: installed
Now with the new release, status goes through everything instead of instantly crashing, which will hopefully make this a bit nicer to debug...
Still haven't found how why this happens, but it does keep happening.
I have freshly installed .13 so all of this is created with .12: