alphafold
alphafold copied to clipboard
V2.2.0 convenient downloadscripts
Hi,
I tried to add a few features to the download scripts as a remedy to some potentially annoying issues causing tickets and to speed up the download-uncompress processes.
Specifically:
- when using rsync in a multi-user system (e.g. an HPC cluster), some sites choose not to allow immediate internet access, but force to use a proxy on head nodes. In this case, rsync will run into a timeout. When accepting the pull request, this case is caught and a hint with regard to setting the RSYNC_PROXY variable is printed. (might save you some issue tickets)
- as downloads might take quite some time, errors cannot be ruled out. In this case, users are forced to make a new attempt. When accepting the pull request, now, users are asked whether they want to proceed. If
yes
, the scripts will remove theROOT_DIR
first. Else, the scripts will cowardly refuse to proceed. Why? Because triggering a specific download script in error, will else lead to operate again. (might yield some less-annoyed users) - some files are rather large, so when finding
pigz
in PATH, uncompressing withpigz
is attempted. The parallelism is NOT in the decompression, however, as the file handling is separated from the decompressing step a minor speed-up can be achieved. - particularly, with the uncompress step for the mmcif download script, there is a line like
find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +
, which takes ages to complete. Here, switching tofind "${RAW_DIR}/" -type f -iname "*.gz" -print0 | xargs -0 -P2 "${uncompress_cmd}"
yields a speed-up of about factor 2. The hardcoded-P2
is a bit unfortunate, yet I do not know whether it makes sense to figure out, what parallelism is allowed for the user (e.g. reading number of processors, reading the c-group, taking the minimum value), because much will depend on the file system and the current status (strain) it is in.
Your comments are most appreciated. I hope, that you find my contribution worth considering.
Best regards Christian Meesters
In the last two commits, I had to notice by means of a user report, that only the executing user had read permissions. This, of course, needed to be fixed for a multi-user system.
Essentially, I did chmod 444
- this, however, is questionable. If the database should be versioned, which could be accomplished, with a grep on a versioned setup script (see #447 - a git describe --tags --abbrev=0
is not possible for non-git directories as in extracted release downloads), then read-only files should be ensured. Else, they need to be user-writable, too.
Opinions?
Hi,
I hoped to at least spark a bit of a discussion, as the mentioned issues still persists for multiuser systems. Whether the work with pgzip
is appreciated is admittedly perhaps not worth a discussion. Yet, downloading the db and getting the user permissions right should not be an issue, right?