aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`verdi storage backup`

Open eimrek opened this issue 1 year ago • 68 comments

This PR adds functionality to back up an AiiDA profile.

eimrek avatar Jul 05 '23 07:07 eimrek

Codecov Report

Attention: Patch coverage is 86.27451% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 78.16%. Comparing base (e330004) to head (98f8456). Report is 8 commits behind head on main.

Files Patch % Lines
src/aiida/storage/psql_dos/backend.py 81.40% 8 Missing :warning:
src/aiida/orm/implementation/storage_backend.py 89.48% 4 Missing :warning:
src/aiida/cmdline/commands/cmd_storage.py 93.34% 1 Missing :warning:
src/aiida/storage/sqlite_dos/backend.py 66.67% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6069      +/-   ##
==========================================
+ Coverage   77.81%   78.16%   +0.35%     
==========================================
  Files         550      558       +8     
  Lines       40720    42009    +1289     
==========================================
+ Hits        31684    32833    +1149     
- Misses       9036     9176     +140     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 05 '23 08:07 codecov[bot]

Thanks a lot @eimrek . It would be super useful indeed if we can have an easy way to backup a profiles data storage (and restore it), so appreciate you looking into this.

Where is the correct location?;

If it is necessary to have a standalone bash script, I think it would make most sense to put it in the docs directory, and add a section (if it doesn't already exist) in the docs where it explains how to use it and there include an HTML link that allows the user to download a copy of the script. There are already examples in the docs of how you can add such a download link to serve a file.

Does calling the separate check_if_profile_locked.py make sense?;

Not sure. I don't think it is actually being used at all in the current state of this PR, is it? Is your idea that this backup should be able to be run even when the profile is in use? Or should the user make sure nothing is using the profile during the backup process. I would personally definitely favor the latter option. Just as the verdi storage maintain --full we just enforce it is not in use by adding this check. If the check is needed, ideally we don't have an ad-hoc separate python script for this but build it into verdi somehow.

Then a few comments about the current design:

  • We have been working on removing any code in the Python API and the verdi CLI that hardcodes the implementation of the psql_dos storage backend. It is now possible to have profiles with different storage backends and so hard-coding these commands to the psql_dos is not what we want. Good example of a command that still does this is verdi profile delete. This should really be refactored where the logic of the deletion is implemented on the PsqlDosStorage class and verdi profile delete just calls to it. Similarly therefore, I don't think it is a good idea to add verdi profile dbdump and hardcode it to psql_dos. I am wondering if instead we should define the backup abstract method on the StorageBackend class and have PsqlDosBackend implement this. I also don't see why this should do only the database and not also the repository.
  • Why is the backing up of the disk-objectstore container done in so many individual steps in the backup script? Why doesn't it just do an rsync -azh --delete? Is this because you want to make it possible to have the backup script run on a profile that is actively being used?

My first gut-instinct would be to implement something like the following:

# Add abstract method to the base storage class

class StorageBackend:

    @abstractmethod
    def backup(self, **kwargs):
        """Create a backup of the storage contents."""

# which is then implemented in subclasses such as the `psql_dos` backend

class PsqlDosBackend(StorageBackend):

    def backup(self, filepath_backup: pathlib.Path):
        """Create a backup of the postgres database and disk-objectstore to the provided path."""
        # Here you can now use the Python API to get access to connection details of the postgres
        # database and location of the disk-object store. Simplest is then to use `subprocess` to
        # call `pg_dump` and `rsync`.
        cfg = self._profile.storage_config

        env = os.environ.copy()
        env['PGPASSWORD'] = cfg['database_password']

        try:
            subprocess.run(
                ['pg_dump', '-h', cfg['database_hostname'], '-f', str(filepath_backup / 'db.psql'), ...],
                check=True,
                env=env
            )
        except subprocess.CalledProcessError:
            # Handle it
        
        filepath_container = get_filepath_container(self._profile)

        try:
            subprocess.run(['rsync', '-azh', str(filepath_container), str(filepath_backup)], check=True])
        except subprocess.CalledProcessError:
            # Handle it

The advantage of this approach would be:

  • Not adding any verdi commands that are hardcoded to work only for the psql_dos storage but they can also work for any other storage implementation
  • No need for separate scripts that need to be maintained and downloaded by the user
  • Can make a generic verdi storage backup command that works for all profiles and storage backends. This command can make use of the built-in profile locking and checks, just like verdi storage maintain does, so no need for stand-alone script for this

Do you see any fundamental problems with such an approach? Is there anything critical in the bash script you wrote that could not be implemented in Python on the PsqlDosBackend class?

sphuber avatar Jul 06 '23 00:07 sphuber

Thanks a lot @sphuber for the comments. What you say makes sense. Initially the idea was that a bash script is preferential because this allows the users to see the logic and potentially modify it to their use case, but probably this is not needed very often. We discussed this and probably implementing this directly in python, as you suggest, is better. I'll try to adapt.

Regarding

Why is the backing up of the disk-objectstore container done in so many individual steps in the backup script? Why doesn't it just do an rsync -azh --delete? Is this because you want to make it possible to have the backup script run on a profile that is actively being used?

The idea is that if the users is adding nodes at the same time as the backup is running, this is the safest order to back up the DOS. (1. loose files; 2. sqlite db; 3. packed files).

eimrek avatar Jul 06 '23 08:07 eimrek

The idea is that if the users is adding nodes at the same time as the backup is running, this is the safest order to back up the DOS. (1. loose files; 2. sqlite db; 3. packed files).

I see. I would personally just opt to say that we don't support using the profile during backup. We do the same for the maintenance operation. Now I can see the argument that technically there, at least for the full maintenance, concurrency is impossible, and for backing up it is possible in the way you describe, but it does add additional complexity and potential sources of problems. Would it not be easier to just say to users that for backing up the storage the profile should not be used? I think this is reasonable, or do you see a fundamental problem with this where people might need/want to backup so often that stopping using the profile is too disruptive?

And if it is necessary to be able to do a live backup, maybe this logic can be implemented in the disk-objectstore directly? If it is this specific, maybe it would make sense to have some API there, either in the Python API or through the CLI (I think it has a dostore CLI endpoint)? If ever its data structures or behavior changes, then the backup script can be updated in a coherent manner.

sphuber avatar Jul 06 '23 15:07 sphuber

@sphuber thanks for the comments - I agree on the suggestions on top, at the beginning the idea is that it was complex to make it general enough and it was better to generate a script that people could change. But in the end, it's probably possible so we should just have a command in verdi that does everything. This also indeed would allow to have the functionality to avoid concurrent maintenance operations by just using the existing functionality in AiiDA. I agree doing it on the backend.

However, I think it's essential that people will be able to backup while they are using AiiDA. And this works fast for incremental backups. If we ask people to stop their work to backup, they will essentially never do it. It should be something that they can put in a cron job and forget about it.

We could do some thing in disk-objectstore, and I see the points, but this will not allow to easily reuse the logic to lock the profile etc, and anyway we want to backup both the DB. Anyway in AiiDA it will be specific to the backend. I suggest to go ahead at least to converge to a version we are happy with. We can always move this to disk-objecstore once we see the final implementation, either before merging this PR, or even later in the future (https://github.com/aiidateam/disk-objectstore/issues/20)

giovannipizzi avatar Jul 07 '23 09:07 giovannipizzi

However, I think it's essential that people will be able to backup while they are using AiiDA. And this works fast for incremental backups. If we ask people to stop their work to backup, they will essentially never do it. It should be something that they can put in a cron job and forget about it.

Fair enough. If you think we can guarantee to do this safely, then I'm fine with doing it that way.

We could do some thing in disk-objectstore, and I see the points, but this will not allow to easily reuse the logic to lock the profile etc, and anyway we want to backup both the DB. Anyway in AiiDA it will be specific to the backend.

I don't understand how this follows. If you implement the functionality in the disk-objectstore to backup a container, either through Python API and/or its CLI, the AiiDA storage backend can just call this. So from AiiDA's perspective, it will be behind the profile locking functionality. Of course when users would go straight to the disk-objectstore API they could circumvent this, but in the disk-objectstore the concept of a profile doesn't exist anyway so there is nothing to lock.

So if we are anyway writing the logic to create a container backup, then I don't see a reason not to directly put it in the disk-objectstore package and have aiida-core call it. Since we control this package, once @eimrek has implemented the logic in this PR and we have verified it works, it is trivial to move it to disk-objectstore, make a release and then use it here.

sphuber avatar Jul 07 '23 16:07 sphuber

I had some time to work on this. Still keeping this a draft, as there is some reorganization left to be done. The two main methods added (both in aiida/storage/psql_dos/backend.py) are:

  1. backup(path, remote, prev_backup) This backs up the storage data to path with rsync and hard links against prev_backup.
  2. backup_auto(path, remote) This is higher level, what we discussed with @giovannipizzi and @superstar54, where the user doesn't have to manage prev_backup themselves, and the backup is done in a safe manner to path, replacing the old backup upon completion. This is currently exposed via the verdi storage backup.

Script 2) probably shouldn't be in aiida/storage/psql_dos/backend.py, as it would work with any storage backend.

Another option is to instead exponse script 1) via verdi storage backup and remove script 2) from aiida-core. This would be more flexible, but it might be less convenient for users.

@sphuber, if you have a moment, any suggestions welcome.

eimrek avatar Aug 03 '23 17:08 eimrek

Reorganized a bit: there are some command line commands that i'm calling (including rsync) that don't really need to be in the PsqlDosBackend class, so i separated them into a module: backup_utils. Completely fine to put somewhere else.

One tiny usablity doubt for the current implementation is that without specifying prev_backup, the backup will be made to a subfolder of the specified --path: <path>/last-backup. But if the user does specify prev_backup, it is created directly in path. I think this is explained adequately in the --help. The reasoning is that in the default case, i automatically manage the "previous" and "live" backups and this should be done via subfolders. When user specifies prev_backup, then making subfolders doesn't really make sense. Any suggestions/feedback welcome though.

Pinging who were interested (a quick test would be useful): @unkcpz @superstar54 @edan-bainglass

eimrek avatar Sep 07 '23 18:09 eimrek

@eimrek I did a test in AiiDAlab docker container. It works. I share the command here:

verdi storage backup --pg_dump_exec /opt/conda/envs/aiida-core-services/bin/pg_dump --path /home/jovyan/test/aiida/backup

When I use the remote folder, it requires me to input the password more than ten times. Is it possible to reduce this? or maybe print out a comment before asking password, otherwise, the user may think the password is wrong.

superstar54 avatar Sep 21 '23 11:09 superstar54

From an aiidalab container running on my Windows machine, I get the following:

(base) (backup-script) aiida-core > verdi storage backup --pg_dump_exec /opt/conda/envs/aiida-core-services/bin/pg_dump --path C:\\Users\\edanb\\Aurora\\backup --remote [email protected]
Warning: You are currently using a post release development version of AiiDA: 2.4.0.post0
Warning: Be aware that this is not recommended for production and is not officially supported.
Warning: Databases used with this version may not be compatible with future releases of AiiDA
Warning: as you might not be able to automatically migrate your data.

[email protected]'s password: 
[email protected]'s password: 
Report: Remote '[email protected]' is accessible!
[email protected]'s password: 
[email protected]'s password: 
[email protected]'s password: 
Error: Command '['ssh', '[email protected]', 'mkdir', 'C:\\Users\\edanb\\Aurora\\backup/live_backup']' returned non-zero exit status 1.
Error: Couldn't access/create 'C:\Users\edanb\Aurora\backup/live_backup'!

Not sure at the moment what might be the issue here

edan-bainglass avatar Oct 09 '23 15:10 edan-bainglass

Okay, one immediate issue is that the path needs to be quote-escaped, so \"C:\.......\". This worked, but than I got

Error: Command '['rsync', '-azh', '-vv', '--no-whole-file', '/tmp/tmprfyv4y7t/db.psql', '[email protected]:"C:/Users/edanb/Aurora/backup"/live_backup/']' returned non-zero exit status 12.

Will keep at it later :)

edan-bainglass avatar Oct 09 '23 15:10 edan-bainglass

thanks a lot for the checks @edan-bainglass.

Do i understand correctly that you want to back up aiida from a linux docker container to a windows host? I am not sure if this will work, but it would be very useful to try.

if a python subprocess command fails, you can try to run it manually and see if a modification would make it work, e.g. the last one would be

rsync -azh -vv --no-whole-file /tmp/tmprfyv4y7t/db.psql [email protected]:"C:/Users/edanb/Aurora/backup"/live_backup/

eimrek avatar Oct 09 '23 16:10 eimrek

Yep. That's exactly what I'm doing. Will have to resume tomorrow, but I think it'll work once enough quotes have been escaped 😅 Will report back soon 👍

edan-bainglass avatar Oct 09 '23 16:10 edan-bainglass

This

rsync -azh -vv --no-whole-file /tmp/tmprfyv4y7t/db.psql [email protected]:"C:/Users/edanb/Aurora/backup"/live_backup/

yields

'rsync' is not recognized as an internal or external command,
operable program or batch file.
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(231) [sender=3.2.7]

Some articles online suggest rsync is required also on Windows end. One person got it to work by installing rsync via cygwin on the Windows side. If this works, I would need to discuss with Empa IT installing this on the Empa workstation.

Another solution is to mount from windows a shared folder onto the linux server (container?) and rsync to the shared folder. If this works, I would need to likely write a script to rsync to the shared folder, ssh to the Empa workstation, move the backup from the shared folder over to the dedicated project drive, and clear out the shared folder. Not sure if verdi storage backup would have any issues here.

Will local test to see if I can get it to work. Not sure which, if any, would be applicable at Empa though 🤔

edan-bainglass avatar Oct 10 '23 04:10 edan-bainglass

Some articles online suggest rsync is required also on Windows end. One person got it to work by installing rsync via cygwin on the Windows side. If this works, I would need to discuss with Empa IT installing this on the Empa workstation.

Installed cygwin on my machine, then rsync on it, then exposed cygwin64/bin to path. rsync is now usable on Windows.

verdi storage backup ... still gave issues, so I ran the rsync part manually. One immediate issue is the --path to the backup folder. cygwin starts from the user's dir, so asking for C:/Users/edanb/... yields cygdrive/c/Users/edanb/C:/Users/edanb... which is obviously wrong. For now, removing C:/Users/edanb from the --path argument circumvents this issue. However, this runs into the present issue:

rsync -azh -vv --no-whole-file /tmp/tmpdn4ti05n/db.psql [email protected]:"Aurora/backup"/li
ve_backup/
opening connection using: ssh -l edanb host.docker.internal rsync --server -vvlogDtprze.iLsfxCIvu . Aurora/backup/live_backup/  (9 args)
[email protected]'s password: 
sending incremental file list
rsync: [sender] change_dir "/tmp/tmpdn4ti05n" failed: No such file or directory (2)
delta-transmission enabled
total: matches=0  hash_hits=0  false_alarms=0 data=0

sent 19 bytes  received 43 bytes  13.78 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7]

Any thoughts on change_dir "/tmp/tmpdn4ti05n" failed: No such file or directory (2)?

edan-bainglass avatar Oct 10 '23 06:10 edan-bainglass

thanks @edan-bainglass. Sorry my previous comment was a bit misleading. I'm dumping the postgresql db to a temporary folder in python and then rsyncing that. I didn't realize that this won't be available when you run it manually. You can perhaps run tests with other files. In the end... the python code just uses subprocess to call 1) bash commands on the remote server, for checking input validity and making folders and such; 2) calling rsync. If both of these work manually, we can adapt the python subprocess commands accordingly.

eimrek avatar Oct 10 '23 07:10 eimrek

@eimrek "...the python code just uses subprocess to..." - regarding 1, you say it uses bash to execute commands on the host server. Well, the host is Windows, so no bash. However, mkdir exists, so at least the folder is created, with the caveat that the path must be in quotes. As for 2, see my previous comment regarding getting rsync to work for windows. rsync did work manually on a random file. Moved it from my container to the windows drive. But I still can't get it to work as part of verdi storage backup. I'll try again later today. I'll also check the mount approach, which would circumvent the cross-platform rsyncing issue.

edan-bainglass avatar Oct 11 '23 04:10 edan-bainglass

Another solution is to mount from windows a shared folder onto the linux server (container?) and rsync to the shared folder. If this works, I would need to likely write a script to rsync to the shared folder, ssh to the Empa workstation, move the backup from the shared folder over to the dedicated project drive, and clear out the shared folder. Not sure if verdi storage backup would have any issues here.

Tried this ☝️

Steps:

  1. Start a container from Windows Docker Desktop with a mounted Windows drive (non-WSL) pointing to the container's /home/jovyab/backup
  2. Install dependencies
    • Clone aiida-core -> checkout backup-script -> install
    • Clone disk-objectstore -> checkout backup-cmd -> install
  3. Install rsync on container
    • Had to connect to the container as root to do this (only for this step - remaining steps executed as jovyan)
    • See issue #411 on aiidalab-docker-stack
  4. Run verdi storage backup --pg_dump_exe /opt/conda/envs/aiida-core-services/bin/pg_dump /home/jovyan/backup

Results:

  • Backup produced no issues 🎉
  • Backup accessible via Windows 🎉🎉
  • Follow-up backup uses previous backup correctly to facilitate incremental backups 🎉🎉🎉

@eimrek not sure what the issue was in your case that raised symlink issues 🤷🏻‍♂️

edan-bainglass avatar Nov 03 '23 08:11 edan-bainglass

Thanks @sphuber for the review. I accidentally did the rebase before implementing your changes, so i couldn't directly commit via github, but i implemented everything manually. The functionality seems to work. Currently there are no tests for this is aiida-core, but at least some basic ones should be added, right?

eimrek avatar Jan 12 '24 13:01 eimrek

Thanks @sphuber for the review. I accidentally did the rebase before implementing your changes, so i couldn't directly commit via github, but i implemented everything manually. The functionality seems to work. Currently there are no tests for this is aiida-core, but at least some basic ones should be added, right?

Thanks @eimrek . I agree, we should definitely have some tests. I would add a test for the PsqlDosStorage.backup command directly testing the backup actually works. And then the CLI tests can be very light, just checking the parameters work (that reminds me, it might be best to add as much validation in the CLI itself, for example that keep is a positive integer greater than 1). Don't have to check the "correctness" of the actual backup again as that is already done elsewhere.

sphuber avatar Jan 12 '24 14:01 sphuber

One thing that might be good is to check if the profile names match for a backup and any existing backups in the destination.

eimrek avatar Jan 12 '24 15:01 eimrek

@sphuber I added a very basic pytest that seems to work, but i'm not fully sure how the storage backend is determined in this case. Probably should be modified somehow in a way that psql_dos is selected.

Note: for the other tests to pass, we should make a release of disk-objectstore and update the dependency.

eimrek avatar Jan 18 '24 17:01 eimrek

@sphuber I added a very basic pytest that seems to work, but i'm not fully sure how the storage backend is determined in this case. Probably should be modified somehow in a way that psql_dos is selected.

Thanks. It just takes the backend of the currently loaded profile for the test suite. If you want to force a particular storage backend, you would have to create a profile with that storage and load it. But I would argue that is not the purpose of this test. This test does not test the backup functionality of any storage plugin works, that should be done in tests specific to that storage backup. So I think this test is just fine.

That being said, it would be good to add a test somewhere in tests/storage/psql_dos that explicitly tests the backup functionality of the PsqlDosStorage class. This can just go through the Python API of course.

Note: for the other tests to pass, we should make a release of disk-objectstore and update the dependency.

I was hoping to hold off with this while we tested its integration in aiida-core. For the time being, you can install from the main branch of the repo. You simply replace the disk-objectstore line in pyproject.toml with disk-objectstore@git+https://github.com/aiidateam/disk-objectstore and you do the same in the requirements/*.txt files.

sphuber avatar Jan 19 '24 11:01 sphuber

I adapted the backup docs such that it mentions the new command, but also keeps the "manual instructions".

Additionally, i tested restoring from a backup for a profile with 35K nodes (around 170MB total size) to a different machine and checked that

  • verdi storage info matches;
  • Some basic queries match (e.g. all StructureData nodes).

@edan-bainglass feel free to test on your system & check the updated docs to see if they're easy to follow.

One small doubt I have is whether the backup instructions should be under the topic "How to manage your installation", as it might not be so easy to find.

eimrek avatar Jan 24 '24 11:01 eimrek

I automatically created a pull request (#) that adapts the requirements/ files according to the dependencies specified in the 'pyproject.toml' file.

github-actions[bot] avatar Jan 24 '24 12:01 github-actions[bot]

I automatically created a pull request (#) that adapts the requirements/ files according to the dependencies specified in the 'pyproject.toml' file.

github-actions[bot] avatar Jan 24 '24 14:01 github-actions[bot]

aiida/storage/psql_dos/backend.py:docstring of aiida.storage.psql_dos.backend.PsqlDosBackend._backup:1: WARNING: py:class reference target not found: disk_objectstore.backup_utils.BackupManager

This is why doc build is fail.

unkcpz avatar Jan 25 '24 13:01 unkcpz

aiida/storage/psql_dos/backend.py:docstring of aiida.storage.psql_dos.backend.PsqlDosBackend._backup:1: WARNING: py:class reference target not found: disk_objectstore.backup_utils.BackupManager

This is why doc build is fail.

Thanks Jason. This is because we are temporarily installing disk-objectstore from the repo, but this is not done for the docs, so it does not have the latest version. When we release it and install from PyPI, then the docs should also have this reference.

sphuber avatar Jan 29 '24 10:01 sphuber

aiida/storage/psql_dos/backend.py:docstring of aiida.storage.psql_dos.backend.PsqlDosBackend._backup:1: WARNING: py:class reference target not found: disk_objectstore.backup_utils.BackupManager This is why doc build is fail.

Thanks Jason. This is because we are temporarily installing disk-objectstore from the repo, but this is not done for the docs, so it does not have the latest version. When we release it and install from PyPI, then the docs should also have this reference.

Thanks both, i added the file to docs nitpick-exceptions and the docs are builing nicely.

eimrek avatar Jan 29 '24 11:01 eimrek

From further discussion with @sphuber, it seems still that it would be better to remove the optional arguments --rsync-exe and --pg-dump-exe. The latter as it's just psql_dos specific and don't really fit with the plugin/modularity philosophy. And if we require exes in path, we might as well assume that rsync is in PATH as well.

Regarding how the user knows what executables are needed in PATH, i think we can just leave the storage backend to check and notify the user if something is missing & how to set it up.

Initially, the --pg-dump-exe argument was mentioned by (i think) @superstar54 and @giovannipizzi as you had a specific need for it in some aiidalab environment, right? Is the need still there or can this be done through PATH?

eimrek avatar Jan 29 '24 12:01 eimrek