zram-generator
zram-generator copied to clipboard
reset-device sometimes fails
Sometimes the reset-device command seems to exit with code 1 without any further information, leaving the initialised device in place, which then causes issues when trying to restart the systemd-zram-setup@zram0 service:
Mar 09 05:15:56 jobs-staging kernel: zram: Added device: zram0
Mar 09 05:15:56 jobs-staging systemd[1]: Created slice Slice /system/systemd-zram-setup.
Mar 09 05:15:56 jobs-staging systemd[1]: Expecting device /dev/zram0...
Mar 09 05:15:57 jobs-staging kernel: zram0: detected capacity change from 0 to 6311120
Mar 09 05:15:57 jobs-staging systemd[1]: Found device /dev/zram0.
Mar 09 05:15:57 jobs-staging systemd[1]: Starting Create swap on /dev/zram0...
Mar 09 05:15:57 jobs-staging systemd-makefs[627]: /dev/zram0 successfully formatted as swap (label "zram0", uuid 766a397f-d87f-4637-a7ef-df3f87a5167f)
Mar 09 05:15:57 jobs-staging systemd[1]: Finished Create swap on /dev/zram0.
Mar 09 05:15:57 jobs-staging systemd[1]: Activating swap Compressed Swap on /dev/zram0...
Mar 09 05:15:57 jobs-staging kernel: Adding 3155556k swap on /dev/zram0. Priority:5 extents:1 across:3155556k SSDsc
Mar 09 05:15:57 jobs-staging systemd[1]: Activated swap Compressed Swap on /dev/zram0.
Mar 11 10:46:16 jobs-staging systemd[1]: dev-zram0.swap: Deactivated successfully.
Mar 11 10:46:16 jobs-staging systemd[1]: Deactivated swap Compressed Swap on /dev/zram0.
Mar 11 10:46:16 jobs-staging systemd[1]: Stopping Create swap on /dev/zram0...
Mar 11 10:46:16 jobs-staging systemd[1]: [email protected]: Control process exited, code=exited, status=1/FAILURE
Mar 11 10:46:16 jobs-staging systemd[1]: [email protected]: Failed with result 'exit-code'.
Mar 11 10:46:16 jobs-staging systemd[1]: Stopped Create swap on /dev/zram0.
Mar 11 10:46:19 jobs-staging systemd[1]: Starting Create swap on /dev/zram0...
Mar 11 10:46:19 jobs-staging kernel: zram: Can't change algorithm for initialized device
Mar 11 10:46:19 jobs-staging systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Mar 11 10:46:19 jobs-staging systemd[1]: [email protected]: Failed with result 'exit-code'.
Mar 11 10:46:19 jobs-staging systemd[1]: Failed to start Create swap on /dev/zram0.
Mar 11 10:46:19 jobs-staging systemd[1]: Dependency failed for Compressed Swap on /dev/zram0.
Mar 11 10:46:19 jobs-staging systemd[1]: dev-zram0.swap: Job dev-zram0.swap/start failed with result 'dependency'.
Manually resetting the device with echo 1 | tee /sys/block/zram0/reset allows the service to be started again.
I've seen this many times now on different machines, but I haven't been able to accurately reproduce it.
This corresponds to setup::run_device_reset(&dev) returning an Err which then returns Err from main... which should log
Error: [contents of error]
to the standard error stream which I don't see in the log. Maybe it got misattributed somehow? (I don't think this should be the case since we have default I/O in the service.) Do you have anything in the unfiltered journal from around the time where the first stoppage and failure happens (Mar 11 10:46:16)?
The code itself is
pub fn run_device_reset(device_name: &str) -> Result<()> {
let reset = Path::new("/sys/block").join(device_name).join("reset");
fs::write(reset, b"1")?;
Ok(())
}
which is basically infallible (literally equivalent to printf 1 > /sys/block/$1/reset) and there's no other error paths.
If you don't have anything in the journal, could you perhaps modify [email protected] to have ExecStop=strace ... instead?
Yeah, I checked the source code and I don't get how this can fail either.
How do I see those unfiltered logs?
I can try the strace later, because I keep on running into this on different servers on a weekly basis.
An un-refined journalctl --since=... --until=... should be ground-truth I think
Ah, in that case there's nothing more...
I'll try with strace.