aiida-core
aiida-core copied to clipboard
`verdi storage backup`
This PR adds functionality to back up an AiiDA profile.
Codecov Report
Attention: Patch coverage is 86.27451%
with 14 lines
in your changes are missing coverage. Please review.
Project coverage is 78.16%. Comparing base (
e330004
) to head (98f8456
). Report is 8 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #6069 +/- ##
==========================================
+ Coverage 77.81% 78.16% +0.35%
==========================================
Files 550 558 +8
Lines 40720 42009 +1289
==========================================
+ Hits 31684 32833 +1149
- Misses 9036 9176 +140
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks a lot @eimrek . It would be super useful indeed if we can have an easy way to backup a profiles data storage (and restore it), so appreciate you looking into this.
Where is the correct location?;
If it is necessary to have a standalone bash script, I think it would make most sense to put it in the docs
directory, and add a section (if it doesn't already exist) in the docs where it explains how to use it and there include an HTML link that allows the user to download a copy of the script. There are already examples in the docs of how you can add such a download link to serve a file.
Does calling the separate
check_if_profile_locked.py
make sense?;
Not sure. I don't think it is actually being used at all in the current state of this PR, is it? Is your idea that this backup should be able to be run even when the profile is in use? Or should the user make sure nothing is using the profile during the backup process. I would personally definitely favor the latter option. Just as the verdi storage maintain --full
we just enforce it is not in use by adding this check. If the check is needed, ideally we don't have an ad-hoc separate python script for this but build it into verdi
somehow.
Then a few comments about the current design:
- We have been working on removing any code in the Python API and the
verdi
CLI that hardcodes the implementation of thepsql_dos
storage backend. It is now possible to have profiles with different storage backends and so hard-coding these commands to thepsql_dos
is not what we want. Good example of a command that still does this isverdi profile delete
. This should really be refactored where the logic of the deletion is implemented on thePsqlDosStorage
class andverdi profile delete
just calls to it. Similarly therefore, I don't think it is a good idea to addverdi profile dbdump
and hardcode it topsql_dos
. I am wondering if instead we should define thebackup
abstract method on theStorageBackend
class and havePsqlDosBackend
implement this. I also don't see why this should do only the database and not also the repository. - Why is the backing up of the disk-objectstore container done in so many individual steps in the backup script? Why doesn't it just do an
rsync -azh --delete
? Is this because you want to make it possible to have the backup script run on a profile that is actively being used?
My first gut-instinct would be to implement something like the following:
# Add abstract method to the base storage class
class StorageBackend:
@abstractmethod
def backup(self, **kwargs):
"""Create a backup of the storage contents."""
# which is then implemented in subclasses such as the `psql_dos` backend
class PsqlDosBackend(StorageBackend):
def backup(self, filepath_backup: pathlib.Path):
"""Create a backup of the postgres database and disk-objectstore to the provided path."""
# Here you can now use the Python API to get access to connection details of the postgres
# database and location of the disk-object store. Simplest is then to use `subprocess` to
# call `pg_dump` and `rsync`.
cfg = self._profile.storage_config
env = os.environ.copy()
env['PGPASSWORD'] = cfg['database_password']
try:
subprocess.run(
['pg_dump', '-h', cfg['database_hostname'], '-f', str(filepath_backup / 'db.psql'), ...],
check=True,
env=env
)
except subprocess.CalledProcessError:
# Handle it
filepath_container = get_filepath_container(self._profile)
try:
subprocess.run(['rsync', '-azh', str(filepath_container), str(filepath_backup)], check=True])
except subprocess.CalledProcessError:
# Handle it
The advantage of this approach would be:
- Not adding any
verdi
commands that are hardcoded to work only for thepsql_dos
storage but they can also work for any other storage implementation - No need for separate scripts that need to be maintained and downloaded by the user
- Can make a generic
verdi storage backup
command that works for all profiles and storage backends. This command can make use of the built-in profile locking and checks, just likeverdi storage maintain
does, so no need for stand-alone script for this
Do you see any fundamental problems with such an approach? Is there anything critical in the bash script you wrote that could not be implemented in Python on the PsqlDosBackend
class?
Thanks a lot @sphuber for the comments. What you say makes sense. Initially the idea was that a bash script is preferential because this allows the users to see the logic and potentially modify it to their use case, but probably this is not needed very often. We discussed this and probably implementing this directly in python, as you suggest, is better. I'll try to adapt.
Regarding
Why is the backing up of the disk-objectstore container done in so many individual steps in the backup script? Why doesn't it just do an rsync -azh --delete? Is this because you want to make it possible to have the backup script run on a profile that is actively being used?
The idea is that if the users is adding nodes at the same time as the backup is running, this is the safest order to back up the DOS. (1. loose files; 2. sqlite db; 3. packed files).
The idea is that if the users is adding nodes at the same time as the backup is running, this is the safest order to back up the DOS. (1. loose files; 2. sqlite db; 3. packed files).
I see. I would personally just opt to say that we don't support using the profile during backup. We do the same for the maintenance operation. Now I can see the argument that technically there, at least for the full maintenance, concurrency is impossible, and for backing up it is possible in the way you describe, but it does add additional complexity and potential sources of problems. Would it not be easier to just say to users that for backing up the storage the profile should not be used? I think this is reasonable, or do you see a fundamental problem with this where people might need/want to backup so often that stopping using the profile is too disruptive?
And if it is necessary to be able to do a live backup, maybe this logic can be implemented in the disk-objectstore
directly? If it is this specific, maybe it would make sense to have some API there, either in the Python API or through the CLI (I think it has a dostore
CLI endpoint)? If ever its data structures or behavior changes, then the backup script can be updated in a coherent manner.
@sphuber thanks for the comments - I agree on the suggestions on top, at the beginning the idea is that it was complex to make it general enough and it was better to generate a script that people could change. But in the end, it's probably possible so we should just have a command in verdi that does everything. This also indeed would allow to have the functionality to avoid concurrent maintenance operations by just using the existing functionality in AiiDA. I agree doing it on the backend.
However, I think it's essential that people will be able to backup while they are using AiiDA. And this works fast for incremental backups. If we ask people to stop their work to backup, they will essentially never do it. It should be something that they can put in a cron job and forget about it.
We could do some thing in disk-objectstore, and I see the points, but this will not allow to easily reuse the logic to lock the profile etc, and anyway we want to backup both the DB. Anyway in AiiDA it will be specific to the backend. I suggest to go ahead at least to converge to a version we are happy with. We can always move this to disk-objecstore once we see the final implementation, either before merging this PR, or even later in the future (https://github.com/aiidateam/disk-objectstore/issues/20)
However, I think it's essential that people will be able to backup while they are using AiiDA. And this works fast for incremental backups. If we ask people to stop their work to backup, they will essentially never do it. It should be something that they can put in a cron job and forget about it.
Fair enough. If you think we can guarantee to do this safely, then I'm fine with doing it that way.
We could do some thing in disk-objectstore, and I see the points, but this will not allow to easily reuse the logic to lock the profile etc, and anyway we want to backup both the DB. Anyway in AiiDA it will be specific to the backend.
I don't understand how this follows. If you implement the functionality in the disk-objectstore to backup a container, either through Python API and/or its CLI, the AiiDA storage backend can just call this. So from AiiDA's perspective, it will be behind the profile locking functionality. Of course when users would go straight to the disk-objectstore API they could circumvent this, but in the disk-objectstore the concept of a profile doesn't exist anyway so there is nothing to lock.
So if we are anyway writing the logic to create a container backup, then I don't see a reason not to directly put it in the disk-objectstore
package and have aiida-core
call it. Since we control this package, once @eimrek has implemented the logic in this PR and we have verified it works, it is trivial to move it to disk-objectstore
, make a release and then use it here.
I had some time to work on this. Still keeping this a draft, as there is some reorganization left to be done. The two main methods added (both in aiida/storage/psql_dos/backend.py
) are:
-
backup(path, remote, prev_backup)
This backs up the storage data topath
with rsync and hard links againstprev_backup
. -
backup_auto(path, remote)
This is higher level, what we discussed with @giovannipizzi and @superstar54, where the user doesn't have to manageprev_backup
themselves, and the backup is done in a safe manner topath
, replacing the old backup upon completion. This is currently exposed via theverdi storage backup
.
Script 2) probably shouldn't be in aiida/storage/psql_dos/backend.py
, as it would work with any storage backend.
Another option is to instead exponse script 1) via verdi storage backup
and remove script 2) from aiida-core. This would be more flexible, but it might be less convenient for users.
@sphuber, if you have a moment, any suggestions welcome.
Reorganized a bit: there are some command line commands that i'm calling (including rsync
) that don't really need to be in the PsqlDosBackend
class, so i separated them into a module: backup_utils
. Completely fine to put somewhere else.
One tiny usablity doubt for the current implementation is that without specifying prev_backup
, the backup will be made to a subfolder of the specified --path
: <path>/last-backup
. But if the user does specify prev_backup
, it is created directly in path
. I think this is explained adequately in the --help
. The reasoning is that in the default case, i automatically manage the "previous" and "live" backups and this should be done via subfolders. When user specifies prev_backup
, then making subfolders doesn't really make sense. Any suggestions/feedback welcome though.
Pinging who were interested (a quick test would be useful): @unkcpz @superstar54 @edan-bainglass
@eimrek I did a test in AiiDAlab docker container. It works. I share the command here:
verdi storage backup --pg_dump_exec /opt/conda/envs/aiida-core-services/bin/pg_dump --path /home/jovyan/test/aiida/backup
When I use the remote folder, it requires me to input the password more than ten times. Is it possible to reduce this? or maybe print out a comment before asking password, otherwise, the user may think the password is wrong.
From an aiidalab container running on my Windows machine, I get the following:
(base) (backup-script) aiida-core > verdi storage backup --pg_dump_exec /opt/conda/envs/aiida-core-services/bin/pg_dump --path C:\\Users\\edanb\\Aurora\\backup --remote [email protected]
Warning: You are currently using a post release development version of AiiDA: 2.4.0.post0
Warning: Be aware that this is not recommended for production and is not officially supported.
Warning: Databases used with this version may not be compatible with future releases of AiiDA
Warning: as you might not be able to automatically migrate your data.
[email protected]'s password:
[email protected]'s password:
Report: Remote '[email protected]' is accessible!
[email protected]'s password:
[email protected]'s password:
[email protected]'s password:
Error: Command '['ssh', '[email protected]', 'mkdir', 'C:\\Users\\edanb\\Aurora\\backup/live_backup']' returned non-zero exit status 1.
Error: Couldn't access/create 'C:\Users\edanb\Aurora\backup/live_backup'!
Not sure at the moment what might be the issue here
Okay, one immediate issue is that the path needs to be quote-escaped, so \"C:\.......\"
. This worked, but than I got
Error: Command '['rsync', '-azh', '-vv', '--no-whole-file', '/tmp/tmprfyv4y7t/db.psql', '[email protected]:"C:/Users/edanb/Aurora/backup"/live_backup/']' returned non-zero exit status 12.
Will keep at it later :)
thanks a lot for the checks @edan-bainglass.
Do i understand correctly that you want to back up aiida from a linux docker container to a windows host? I am not sure if this will work, but it would be very useful to try.
if a python subprocess command fails, you can try to run it manually and see if a modification would make it work, e.g. the last one would be
rsync -azh -vv --no-whole-file /tmp/tmprfyv4y7t/db.psql [email protected]:"C:/Users/edanb/Aurora/backup"/live_backup/
Yep. That's exactly what I'm doing. Will have to resume tomorrow, but I think it'll work once enough quotes have been escaped 😅 Will report back soon 👍
This
rsync -azh -vv --no-whole-file /tmp/tmprfyv4y7t/db.psql [email protected]:"C:/Users/edanb/Aurora/backup"/live_backup/
yields
'rsync' is not recognized as an internal or external command,
operable program or batch file.
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(231) [sender=3.2.7]
Some articles online suggest rsync is required also on Windows end. One person got it to work by installing rsync via cygwin on the Windows side. If this works, I would need to discuss with Empa IT installing this on the Empa workstation.
Another solution is to mount
from windows a shared folder onto the linux server (container?) and rsync
to the shared folder. If this works, I would need to likely write a script to rsync
to the shared folder, ssh
to the Empa workstation, move the backup from the shared folder over to the dedicated project drive, and clear out the shared folder. Not sure if verdi storage backup
would have any issues here.
Will local test to see if I can get it to work. Not sure which, if any, would be applicable at Empa though 🤔
Some articles online suggest rsync is required also on Windows end. One person got it to work by installing rsync via cygwin on the Windows side. If this works, I would need to discuss with Empa IT installing this on the Empa workstation.
Installed cygwin
on my machine, then rsync
on it, then exposed cygwin64/bin
to path
. rsync
is now usable on Windows.
verdi storage backup ...
still gave issues, so I ran the rsync
part manually. One immediate issue is the --path
to the backup folder. cygwin
starts from the user's dir, so asking for C:/Users/edanb/...
yields cygdrive/c/Users/edanb/C:/Users/edanb...
which is obviously wrong. For now, removing C:/Users/edanb
from the --path
argument circumvents this issue. However, this runs into the present issue:
rsync -azh -vv --no-whole-file /tmp/tmpdn4ti05n/db.psql [email protected]:"Aurora/backup"/li
ve_backup/
opening connection using: ssh -l edanb host.docker.internal rsync --server -vvlogDtprze.iLsfxCIvu . Aurora/backup/live_backup/ (9 args)
[email protected]'s password:
sending incremental file list
rsync: [sender] change_dir "/tmp/tmpdn4ti05n" failed: No such file or directory (2)
delta-transmission enabled
total: matches=0 hash_hits=0 false_alarms=0 data=0
sent 19 bytes received 43 bytes 13.78 bytes/sec
total size is 0 speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7]
Any thoughts on change_dir "/tmp/tmpdn4ti05n" failed: No such file or directory (2)
?
thanks @edan-bainglass. Sorry my previous comment was a bit misleading. I'm dumping the postgresql db to a temporary folder in python and then rsyncing that. I didn't realize that this won't be available when you run it manually. You can perhaps run tests with other files. In the end... the python code just uses subprocess to call 1) bash commands on the remote server, for checking input validity and making folders and such; 2) calling rsync. If both of these work manually, we can adapt the python subprocess commands accordingly.
@eimrek "...the python code just uses subprocess to..." - regarding 1, you say it uses bash
to execute commands on the host server. Well, the host is Windows, so no bash
. However, mkdir
exists, so at least the folder is created, with the caveat that the path must be in quotes. As for 2, see my previous comment regarding getting rsync
to work for windows. rsync
did work manually on a random file. Moved it from my container to the windows drive. But I still can't get it to work as part of verdi storage backup
. I'll try again later today. I'll also check the mount approach, which would circumvent the cross-platform rsync
ing issue.
Another solution is to
mount
from windows a shared folder onto the linux server (container?) andrsync
to the shared folder. If this works, I would need to likely write a script torsync
to the shared folder,ssh
to the Empa workstation, move the backup from the shared folder over to the dedicated project drive, and clear out the shared folder. Not sure ifverdi storage backup
would have any issues here.
Tried this ☝️
Steps:
- Start a container from Windows Docker Desktop with a mounted Windows drive (non-WSL) pointing to the container's
/home/jovyab/backup
- Install dependencies
- Clone
aiida-core
-> checkoutbackup-script
-> install - Clone
disk-objectstore
-> checkoutbackup-cmd
-> install
- Clone
- Install
rsync
on container- Had to connect to the container as
root
to do this (only for this step - remaining steps executed asjovyan
) - See issue #411 on aiidalab-docker-stack
- Had to connect to the container as
- Run
verdi storage backup --pg_dump_exe /opt/conda/envs/aiida-core-services/bin/pg_dump /home/jovyan/backup
Results:
- Backup produced no issues 🎉
- Backup accessible via Windows 🎉🎉
- Follow-up backup uses previous backup correctly to facilitate incremental backups 🎉🎉🎉
@eimrek not sure what the issue was in your case that raised symlink issues 🤷🏻♂️
Thanks @sphuber for the review. I accidentally did the rebase before implementing your changes, so i couldn't directly commit via github, but i implemented everything manually. The functionality seems to work. Currently there are no tests for this is aiida-core, but at least some basic ones should be added, right?
Thanks @sphuber for the review. I accidentally did the rebase before implementing your changes, so i couldn't directly commit via github, but i implemented everything manually. The functionality seems to work. Currently there are no tests for this is aiida-core, but at least some basic ones should be added, right?
Thanks @eimrek . I agree, we should definitely have some tests. I would add a test for the PsqlDosStorage.backup
command directly testing the backup actually works. And then the CLI tests can be very light, just checking the parameters work (that reminds me, it might be best to add as much validation in the CLI itself, for example that keep
is a positive integer greater than 1). Don't have to check the "correctness" of the actual backup again as that is already done elsewhere.
One thing that might be good is to check if the profile names match for a backup and any existing backups in the destination.
@sphuber I added a very basic pytest that seems to work, but i'm not fully sure how the storage backend is determined in this case. Probably should be modified somehow in a way that psql_dos
is selected.
Note: for the other tests to pass, we should make a release of disk-objectstore
and update the dependency.
@sphuber I added a very basic pytest that seems to work, but i'm not fully sure how the storage backend is determined in this case. Probably should be modified somehow in a way that
psql_dos
is selected.
Thanks. It just takes the backend of the currently loaded profile for the test suite. If you want to force a particular storage backend, you would have to create a profile with that storage and load it. But I would argue that is not the purpose of this test. This test does not test the backup functionality of any storage plugin works, that should be done in tests specific to that storage backup. So I think this test is just fine.
That being said, it would be good to add a test somewhere in tests/storage/psql_dos
that explicitly tests the backup functionality of the PsqlDosStorage
class. This can just go through the Python API of course.
Note: for the other tests to pass, we should make a release of
disk-objectstore
and update the dependency.
I was hoping to hold off with this while we tested its integration in aiida-core
. For the time being, you can install from the main
branch of the repo. You simply replace the disk-objectstore
line in pyproject.toml
with disk-objectstore@git+https://github.com/aiidateam/disk-objectstore
and you do the same in the requirements/*.txt
files.
I adapted the backup docs such that it mentions the new command, but also keeps the "manual instructions".
Additionally, i tested restoring from a backup for a profile with 35K nodes (around 170MB total size) to a different machine and checked that
-
verdi storage info
matches; - Some basic queries match (e.g. all StructureData nodes).
@edan-bainglass feel free to test on your system & check the updated docs to see if they're easy to follow.
One small doubt I have is whether the backup instructions should be under the topic "How to manage your installation", as it might not be so easy to find.
I automatically created a pull request (#) that adapts the requirements/ files according to the dependencies specified in the 'pyproject.toml' file.
I automatically created a pull request (#) that adapts the requirements/ files according to the dependencies specified in the 'pyproject.toml' file.
aiida/storage/psql_dos/backend.py:docstring of aiida.storage.psql_dos.backend.PsqlDosBackend._backup:1: WARNING: py:class reference target not found: disk_objectstore.backup_utils.BackupManager
This is why doc build is fail.
aiida/storage/psql_dos/backend.py:docstring of aiida.storage.psql_dos.backend.PsqlDosBackend._backup:1: WARNING: py:class reference target not found: disk_objectstore.backup_utils.BackupManager
This is why doc build is fail.
Thanks Jason. This is because we are temporarily installing disk-objectstore from the repo, but this is not done for the docs, so it does not have the latest version. When we release it and install from PyPI, then the docs should also have this reference.
aiida/storage/psql_dos/backend.py:docstring of aiida.storage.psql_dos.backend.PsqlDosBackend._backup:1: WARNING: py:class reference target not found: disk_objectstore.backup_utils.BackupManager
This is why doc build is fail.Thanks Jason. This is because we are temporarily installing disk-objectstore from the repo, but this is not done for the docs, so it does not have the latest version. When we release it and install from PyPI, then the docs should also have this reference.
Thanks both, i added the file to docs nitpick-exceptions and the docs are builing nicely.
From further discussion with @sphuber, it seems still that it would be better to remove the optional arguments --rsync-exe
and --pg-dump-exe
. The latter as it's just psql_dos specific and don't really fit with the plugin/modularity philosophy. And if we require exes in path, we might as well assume that rsync is in PATH as well.
Regarding how the user knows what executables are needed in PATH, i think we can just leave the storage backend to check and notify the user if something is missing & how to set it up.
Initially, the --pg-dump-exe
argument was mentioned by (i think) @superstar54 and @giovannipizzi as you had a specific need for it in some aiidalab environment, right? Is the need still there or can this be done through PATH?