vm icon indicating copy to clipboard operation
vm copied to clipboard

Make LVM snapshot default when no issues get reported

Open szaimen opened this issue 3 years ago • 51 comments

This is just a reminder that we don't forget to make the LVM snapshot default when no issues get reported. https://github.com/nextcloud/vm/blob/39e64fe07920bea14f064abaf71847cd5c7165a3/nextcloud_install_production.sh#L68 After we do this, everyone wil be able to use the built-in backup solution.

szaimen avatar May 22 '21 21:05 szaimen

I'd say when development stopped on that part for some time because it's rock solid, then maybe. :)

enoch85 avatar May 22 '21 21:05 enoch85

I'd say it is already pretty stable but yeah

szaimen avatar May 22 '21 21:05 szaimen

We could do this for Ubuntu 22.04 making the OS disk 45 GB in size, or extend the drive so there's only 5 GB left and keep it 40 GB in total size.

Would that work?

enoch85 avatar Jul 31 '21 09:07 enoch85

cc @small1

enoch85 avatar Jul 31 '21 09:07 enoch85

We could do this for Ubuntu 22.04 making the OS disk 45 GB in size, or extend the drive so there's only 5 GB left and keep it 40 GB in total size.

Would that work?

From my side, yes 👍

szaimen avatar Aug 02 '21 21:08 szaimen

Just tested on one of my prod instances. I don't think this is stable enough:

Last login: Thu Oct  7 12:49:11 2021 from blablabla
root@cloud:~# bash /var/scripts/update.sh 
Posting notification to users that are admins, this might take a while...
Posting 'Update script started!' to: enoch85
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
Maintenance mode enabled
  Logical volume ubuntu-vg/NcVM-snapshot is used by another device.
Maintenance mode disabled
Starting docker...
Posting notification to users that are admins, this might take a while...
Posting 'Update failed!' to: enoch85

enoch85 avatar Oct 12 '21 18:10 enoch85

Logical volume ubuntu-vg/NcVM-snapshot is used by another device.

Honestly, I've never seen this issue. What did you do before this issue appeard? Did you reinstall ubuntu from scratch and choosed to add the partition in the install script?

szaimen avatar Oct 12 '21 19:10 szaimen

Did you reinstall ubuntu from scratch and choosed to add the partition in the install script?

This way I'am running this setup since half a year or longer without any issue...

szaimen avatar Oct 12 '21 19:10 szaimen

Or in other words: what are the steps to reproduce this issue?

szaimen avatar Oct 12 '21 20:10 szaimen

Did you reinstall ubuntu from scratch and choosed to add the partition in the install script?

Yes, since the company was sold, we moved the whole thing to a new server with a new install and export import of DB and stuff. So it's by the book installed "your way".

Or in other words: what are the steps to reproduce this issue?

I don't know. I just ran an update yesterday and it happened. No automatic updates either.

enoch85 avatar Oct 13 '21 18:10 enoch85

Thanks! So then I will try to investigate how this could happen :)

szaimen avatar Oct 13 '21 18:10 szaimen

Does it happen every time you run the update script?

szaimen avatar Oct 31 '21 15:10 szaimen

Could be a bug with lvm... https://blog.roberthallam.org/2017/12/solved-logical-volume-is-used-by-another-device/

Could you please try the following commands and post the output of those here (if it should still happen)?

lvremove -v /dev/ubuntu-vg/NcVM-snapshot

dmsetup info -c | grep NcVM | grep snapshot

# more to come when we have more info based on the guide linked above

szaimen avatar Oct 31 '21 15:10 szaimen

@enoch85 do you have some feedback here? It is hard to debug without a way to reproduce this issue...

szaimen avatar Nov 18 '21 22:11 szaimen

As it's not in the released version yet, please add a PR with the fix you proposed, and I'll run one of the auto update VMs with the new setup.

enoch85 avatar Nov 19 '21 08:11 enoch85

I can try. But after reading through the code, did you try to reboot the affected server once after you got the notification that the update failed because of the failed lvremove? image

szaimen avatar Nov 19 '21 09:11 szaimen

I've only seen this once, and I'm not sure if the server was rebooted or not.

If you think it can be improved, then do so, else leave it for now.

enoch85 avatar Nov 19 '21 10:11 enoch85

Thanks for the feedback! Honestly, since I still think that this is a bug in LVM itself, I don't think I can improve the logic/code. I could try to work around the symptoms but not solve the issue itself. So a reboot is probably still the best option in this case. Since you only saw this once, I think its fine, though. Do you agree?

szaimen avatar Nov 19 '21 12:11 szaimen

I'm still not convinced it should be the default way of the VM. One more thing that could break - we want to keep those events limited.

enoch85 avatar Nov 19 '21 17:11 enoch85

It happened again.

  1. Run menu.sh --> minor
  2. It finished as expected
  3. Run menu.sh --> update again

image

image

enoch85 avatar Jan 29 '22 12:01 enoch85

Some debug output:

Posting 'Update script started!' to: enoch85
++ hostname -f
+ nextcloud_occ_no_check notification:generate -l 'The update script in the Nextcloud VM has been executed.
You will be notified when the update is done.
Please don'\''t shutdown or restart your server until then.' enoch85 'cloud.hanssonit.se: Update script started!'
+ sudo -u www-data php /var/www/nextcloud/occ notification:generate -l 'The update script in the Nextcloud VM has been executed.
You will be notified when the update is done.
Please don'\''t shutdown or restart your server until then.' enoch85 'cloud.hanssonit.se: Update script started!'
+ check_free_space
+ vgs
++ vgs
++ grep ubuntu-vg
++ awk '{print $7}'
++ grep -oP '[0-9]+\.[0-9]'
++ sed 's|\.||'
++ grep g
+ FREE_SPACE=
+ '[' -z '' ']'
+ FREE_SPACE=0
+ '[' -f /var/scripts/nextcloud-startup-script.sh ']'
+ does_snapshot_exist NcVM-startup
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-startup ']'
+ return 1
+ does_snapshot_exist NcVM-snapshot
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-snapshot ']'
+ return 0
+ '[' -f /var/scripts/daily-borg-backup.sh ']'
+ crontab -u root -l
+ grep -v 'lvrename /dev/ubuntu-vg/NcVM-snapshot-pending'
+ crontab -u root -
+ crontab -u root -l
+ cat
+ crontab -u root -
+ echo '@reboot /usr/sbin/lvrename /dev/ubuntu-vg/NcVM-snapshot-pending /dev/ubuntu-vg/NcVM-snapshot &>/dev/null'
+ SNAPSHOT_EXISTS=1
+ is_docker_running
+ docker ps -a
+ check_command systemctl stop docker
+ systemctl stop docker
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
+ nextcloud_occ maintenance:mode --on
+ check_command sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
+ sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
Maintenance mode enabled
+ does_snapshot_exist NcVM-startup
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-startup ']'
+ return 1
+ does_snapshot_exist NcVM-snapshot
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-snapshot ']'
+ return 0
+ lvremove /dev/ubuntu-vg/NcVM-snapshot -y
  Logical volume ubuntu-vg/NcVM-snapshot is used by another device.
+ nextcloud_occ maintenance:mode --off
+ check_command sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
+ sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
Maintenance mode disabled
+ start_if_stopped docker
+ pgrep docker
+ print_text_in_color '\e[0;96m' 'Starting docker...'
+ printf '%b%s%b\n' '\e[0;96m' 'Starting docker...' '\e[0m'
Starting docker...
+ systemctl start docker.service
++ date +%T
+ notify_admin_gui 'Update failed!' 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33'
+ local NC_USERS
+ local user
+ local admin
+ is_app_enabled notifications
+ sed '/Disabled/,$d'
+ awk '{print$2}'
+ nextcloud_occ app:list
+ check_command sudo -u www-data php /var/www/nextcloud/occ app:list
+ sudo -u www-data php /var/www/nextcloud/occ app:list
+ sed '/^$/d'
+ grep -q '^notifications$'
+ tr -d :
+ return 0
+ print_text_in_color '\e[0;96m' 'Posting notification to users that are admins, this might take a while...'
+ printf '%b%s%b\n' '\e[0;96m' 'Posting notification to users that are admins, this might take a while...' '\e[0m'
Posting notification to users that are admins, this might take a while...
+ send_mail 'Update failed!' 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33'
+ local RECIPIENT
+ '[' -f /etc/msmtprc ']'
+ return 1
+ '[' -z enoch85 ']'
+ for admin in "${NC_ADMIN_USER[@]}"
+ print_text_in_color '\e[0;92m' 'Posting '\''Update failed!'\'' to: enoch85'
+ printf '%b%s%b\n' '\e[0;92m' 'Posting '\''Update failed!'\'' to: enoch85' '\e[0m'
Posting 'Update failed!' to: enoch85
++ hostname -f
+ nextcloud_occ_no_check notification:generate -l 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33' enoch85 'cloud.hanssonit.se: Update failed!'
+ sudo -u www-data php /var/www/nextcloud/occ notification:generate -l 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33' enoch85 'cloud.hanssonit.se: Update failed!'
+ msg_box 'It seems like the old snapshot could not get removed.
This should work again after a reboot of your server.'
+ '[' -n '' ']'
+ whiptail --title 'Nextcloud VM - 2022 - Nextcloud Update Script' --msgbox 'It seems like the old snapshot could not get removed.
This should work again after a reboot of your server.' '' ''
+ exit 1

enoch85 avatar Jan 29 '22 12:01 enoch85

Thanks for the verbose output! Please try the following and report back:

lvremove -v /dev/ubuntu-vg/NcVM-snapshot

dmsetup info -c | grep NcVM | grep snapshot

# more to come when we have more info based on the guide linked above

szaimen avatar Jan 29 '22 12:01 szaimen

Already rebooted ;/

enoch85 avatar Jan 29 '22 12:01 enoch85

Already rebooted ;/

hm :/

szaimen avatar Jan 29 '22 12:01 szaimen

OK, managed to reproduce it and here's the output:

root@cloud:~# lvremove -v /dev/ubuntu-vg/NcVM-snapshot
  Logical volume ubuntu-vg/NcVM-snapshot in use.
root@cloud:~# dmsetup info -c | grep NcVM | grep snapshot
ubuntu--vg-NcVM--snapshot     253   3 L--w    1    1      2 LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7Kpttkwe5oq1zHuHZW7Ia6auXkP4fS59G1HaSX     
ubuntu--vg-NcVM--snapshot-cow 253   2 L--w    1    1      2 LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7Kpttkwe5oq1zHuHZW7Ia6auXkP4fS59G1HaSX-cow 

enoch85 avatar Jan 29 '22 12:01 enoch85

Great! As a follow up: whats the output of

ls -la /sys/dev/block/253\:3/holders
ls -la /sys/dev/block/253\:2/holders

szaimen avatar Jan 29 '22 12:01 szaimen

When I have the output it should only take one command to remove the blocking device and afterwards the lvremove should finally work :) This would then be a better way to solve this instead of rebooting that we can automate in case lvremove fails :)

szaimen avatar Jan 29 '22 12:01 szaimen

root@cloud:~# ls -la /sys/dev/block/253\:3/holders
total 0
drwxr-xr-x 2 root root 0 jan 29 13:33 .
drwxr-xr-x 9 root root 0 jan 29 13:33 ..
root@cloud:~# ls -la /sys/dev/block/253\:2/holders
total 0
drwxr-xr-x 2 root root 0 jan 29 13:33 .
drwxr-xr-x 9 root root 0 jan 29 13:33 ..
lrwxrwxrwx 1 root root 0 jan 29 14:29 dm-3 -> ../../dm-3

enoch85 avatar Jan 29 '22 13:01 enoch85

Thanks! after runing the following command, the removal should work. please report back!

dmsetup remove /dev/dm-3
lvremove -v /dev/ubuntu-vg/NcVM-snapshot

szaimen avatar Jan 29 '22 13:01 szaimen

If that works, I will try to come up with a PR that fixes this once and for all :)

szaimen avatar Jan 29 '22 14:01 szaimen