alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

V2.2.0 convenient downloadscripts

Open cmeesters opened this issue 2 years ago • 1 comments

Hi,

I tried to add a few features to the download scripts as a remedy to some potentially annoying issues causing tickets and to speed up the download-uncompress processes.

Specifically:

  • when using rsync in a multi-user system (e.g. an HPC cluster), some sites choose not to allow immediate internet access, but force to use a proxy on head nodes. In this case, rsync will run into a timeout. When accepting the pull request, this case is caught and a hint with regard to setting the RSYNC_PROXY variable is printed. (might save you some issue tickets)
  • as downloads might take quite some time, errors cannot be ruled out. In this case, users are forced to make a new attempt. When accepting the pull request, now, users are asked whether they want to proceed. If yes, the scripts will remove the ROOT_DIR first. Else, the scripts will cowardly refuse to proceed. Why? Because triggering a specific download script in error, will else lead to operate again. (might yield some less-annoyed users)
  • some files are rather large, so when finding pigz in PATH, uncompressing with pigz is attempted. The parallelism is NOT in the decompression, however, as the file handling is separated from the decompressing step a minor speed-up can be achieved.
  • particularly, with the uncompress step for the mmcif download script, there is a line like find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +, which takes ages to complete. Here, switching to find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | xargs -0 -P2 "${uncompress_cmd}" yields a speed-up of about factor 2. The hardcoded -P2 is a bit unfortunate, yet I do not know whether it makes sense to figure out, what parallelism is allowed for the user (e.g. reading number of processors, reading the c-group, taking the minimum value), because much will depend on the file system and the current status (strain) it is in.

Your comments are most appreciated. I hope, that you find my contribution worth considering.

Best regards Christian Meesters

cmeesters avatar Apr 27 '22 06:04 cmeesters

In the last two commits, I had to notice by means of a user report, that only the executing user had read permissions. This, of course, needed to be fixed for a multi-user system.

Essentially, I did chmod 444 - this, however, is questionable. If the database should be versioned, which could be accomplished, with a grep on a versioned setup script (see #447 - a git describe --tags --abbrev=0 is not possible for non-git directories as in extracted release downloads), then read-only files should be ensured. Else, they need to be user-writable, too.

Opinions?

cmeesters avatar Apr 27 '22 10:04 cmeesters

Hi,

I hoped to at least spark a bit of a discussion, as the mentioned issues still persists for multiuser systems. Whether the work with pgzip is appreciated is admittedly perhaps not worth a discussion. Yet, downloading the db and getting the user permissions right should not be an issue, right?

cmeesters avatar Feb 01 '23 09:02 cmeesters