Using mars to replicate running OS itself
We have basic computer with 1 disk inside it. And basic Linux install on it. /dev/sda ---/dev/sda1 mounts on / (Linux OS lives here) ---/dev/sda2 mounts on /mars
For simplicity, imagine this as your basic home computer, which you turn on and use - browse web, save files etc. We want to have mars on this computer. So that at any given point in time, we always have a copy of /dev/sda1 in a remote site B (And we don't care at this moment whats happening in B. As long as we have a copy, it's ok).
How do we achieve this with current state of mars (No new development)?
- Do we somehow use shell during OS install to create mars resources?
- Do we install OS first and then install mars there and somehow tell mars to replicate sda1 to external site?
How do we achieve this with current state of mars (No new development)?
1. Do we somehow use shell during OS install to create mars resources?
Whisper -- whether you are installing MARS during whatever usable deployment step (currently not possible during initial OS install) or deployment method, you will always need marsadm written in Perl.
If this is usable for you, but shell isn't usable for you, there might be some smaller tasks, like substituting some Perl-level system($shell_cmd) with some Perl-internal replacements. There exists already some Perl code, e.g. for triggering systemd, whithout necessarily involving the bash. Doable, but will need a new MARS release.
If the bash itself is no problem for you, then just go on.
If your concern is not the bash itself, but some already existing shell scripts (e.g. in /etc or /boot), or if you cannot patch such scripts for whatever reason: hmm. Then I don't know how to answer your question :(
2. Do we install OS first and then install mars there and somehow tell mars to replicate sda1 to external site?
Currently: "yes" for the first part of your question, and "yes" for the second part = installtion, and "yes or no" for the third part = replication.
In more detail: the current MARS version is constructed for replication of data partitions.
Likely "yes" because you said that /mars is on a deparate partitions, which is not being replicated. Otherwise the current status would be "no".
Some caveat: replication of "/" hasn't been tested by me. Currently I am not sure whether it would work out of the box. By exclusion of self-replication of /mars (as designed by you), it should be doable, under the following additional circumstances:
Working title of the following idea: "role-based OS instances" or similar.
AFAICS you may need a different role-based setup for the secondary site B. The secondary-role side should run purely passively under a different IP, e.g. under some PXE-booted RAMdisk, and continuously mirror everything onto their local /dev/sda1 replica. When the primary crashes, and the roles need to be switched, and when then a reboot of the secondary is acceptable, the reboot should initialize everything in almost the same way as if the original box A would have been rebooted.
Afterwards, you must not reboot side A into its ordinary mode, because it would result in two machines having the same IP.
AFAICS, then you should be allowed to reboot side A into the PXE-based RAMdisk but a different IP, in order to reverse the direction of the replication.
MARS should support runtime IP address changes (dynamically during operations) via the command "marsadm lowlevel-set-host-ip ..." as documented.
This is just a rough idea, not yet tested.
Addendum: in place of PXE ramdisks, you might allocate some /dev/sda3 or similar in each site, which is NOT replicated, but you can install a different OS instance just for the secondary role. Of course, you don't need to install your full application stack there, but just the needed parts for MARS replication.
Likely, suchalike might also work for more than 1 secondary, e.g. 3 or 4 replicas in total. But you will be always remain responsible (even during Ahrtal-like geo-disasters) that only the primary announces its globally unique IP, and noone else.
Via IP forwarding and siblings, and/or BGP and siblings, these "requirements invented by me"(tm) can be likely relaxed ;)
There are probably some further alternatives, e.g. when using other incarnations of role-based OS instances.
A question to you: can you tell me your opinion about
a. split brain
b. CAP theorem
Likely this ia another issue, requiring a different discussion thread, but it may interfere with this discussion.
Pre OS shell mars install - I don't mind re tools, bash perl whatever. We are only talking about technical possibilities without extra development. But ok, now I understand this might not be needed.
Site B secondary - yeah I know there are things there too to think about, but lets ignore it for the moment and focus at issue in hand. Easier ;) one step at a time.
After OS mars install - ok so we install mars when we have OS ready and working. And yes, we have /mars on a separate device. You talked about / replication and /mars exclusion. Well, mars is not aware of filesystem, only a block device sda1 - and you know it best. So I don't see why would it ever be an issue to replicate OS partition if /mars doesn't live on it. A. Imagine we have OS running now. And we setup mars and ask him to replicate the OS partition. Is this technically possible? Because we can't unmount the OS partition... B. I guess you are not aware of anyone trying to replicate OS where mars is running? I want to understand how likely this would work with currently developed version.
Split brain - I understand the concept. I don't need to worry about this at this moment and in my general scenario for the first requirements.
CAP theorem - didn't come across it before. Just had a quick look. Doesn't apply so much in my scenario. But I got the point in general.
Well, mars is not aware of filesystem, only a block device sda1 - and you know it best. So I don't see why would it ever be an issue to replicate OS partition if /mars doesn't live on it.
Short answer: yes MARS is constructed for a clean separation between IO on /mars filesystem and "every else filesystem". Unfortunatly, I never tested it for the special case that "else" means "/", because it was not needed by my current employer.
Because we can't unmount the OS partition... B.
Now you mention the tricky part of "I have not tested" ;)
I think it either works already to a reasonable degree for you, or it can be implemented at either (a) kernel level => bugs may be simple to fix, when a missing property => likely a high effort under my current conditions, or (b) much easier for me: tell systemd (or your favourite OS controller) to first umount "/" resp. remount "/" readonly and be sure that / until IO has really stopped ***, when necessary spend some time so that secondaries have a fair chance to fetch the last logfile data over network = the mars-specific ports, before shutting down the MARS subsystem, and waiting that it also has flushed everything, and finally reboot via syscall.
*** means: I have seen some cases in production where this is not always as true as it should be promised by --- ummm --- speculating --- the kernel side, e.g. flushers & co? the sysadmin / operational side? The systemd target configs / dependencies? The hardware / firmware? a misconfig via "reboot -f -n", e.g. caused by some timeout? whatever?
For safety: somebody should test this really (and thoroughly in later project phases).
I guess you are not aware of anyone trying to replicate OS where mars is running? I want to understand how likely this would work with currently developed version.
OK, the likelyness is estimated as: I only know of potential obstacles, but currently I don't know of real obstacles.
Split brain - I understand the concept. I don't need to worry about this at this moment and in my general scenario for the first requirements.
Ah, thanks. I will not worry about this either, except when you tell me something else.
CAP theorem - didn't come across it before. Just had a quick look. Doesn't apply so much in my scenario. But I got the point in general.
OK, accepted.
I think mounting / over MARS is an interesting and intriguing thing, difficult to do: you would definitely have to heavily customize the initrd image and the booting process. I am not sure it is really possible.
What I have now in production are KVM/Qemu hypervisors with VM running in MARS protected resources. There is a clear distinction between the machine running MARS (host) and the ones running the applications (VMs). If I accidentally try to start a VM where the resource is not primary, the hypervisor simply does not start it for lack of device with that name. To move a VM around, I work in the hosts: shutdown, change primary, change DNS, start.
I used to operate a big email server, on a MARS cluster of two nodes, without virtualization: the software was running from filesystems mounted on (LVs from) the MARS resource. Failover was like having a copy of the directory tree in a different server and starting the software from there: we carefully planned and tested it, but in the years there was always the occasional OS update or accidental change that broke something, so every failover was a "5 mins to follow the procedure and 1 hour to fix this time's problems" kind of task. I was very happy when I migrated this to a "standard" VM with MARS in the hypervisor, as described above, and this was a simpler setup than what we are discussing here.
So marksaitis my humble suggestion is: if you are doing something in production, and you hope not to be in a permanent on-call mood, don't mount / on mars. Use a VM, you would have to invest in learning the virtualization environment (I use KVM/Qemu with plain simple libvirt, i.e. "virsh"), and learning its networking (bridge and get more IPs, or route and forward ports with iptables) but I think it is worth.
Hope this helps, Bergonz
IMHO replication or synchronization of OS itself is best done through file-based replication rather than MARS or DRBD. OSS file-based replication such as https://github.com/LINBIT/csync2 and Syncthing (https://syncthing.net) are proven solutions for this scenario.