git-annex_tutorial icon indicating copy to clipboard operation
git-annex_tutorial copied to clipboard

A tutorial for using git-annex to shared large datasets and code within the Lab.

git-annex tutorial: sharing data and code in the Lab and outside

In this tutorial we introduce git-annex and its use within a research laboratory to help share data and code among lab members, external collaborators and anonymous users. git-annex is a software tool that extends the more famous software git in convenient ways when dealing with large files and large repositories. In the following, we introduce some basic concepts and then describe the scenario and the workflow that we implemented in our Lab which, we believe, can be useful to other people in a similar setting. git and git-annex require some dedication before reaching fruitful use. During that process, it is common to make mistakes, as it was for us. For this reason, in this tutorial, we also describe common errors and how to recover from them. Notice that, basic familiarity with git is assumed as a pre-requisite for this tutorial.

The tutorial is structured as follows. First we describe the scenario in which git-annex is used. Then, we provide some preliminary information about what git-annex is and some additional technical details. After that, we describe how to set up a centralized repository that will host a copy of all data. This is the main part of the tutorial, in which we describe how to make the repository easily accessible from a web server and from Github. The last part of the tutorial describes the use of the repository from the point of view of standard users that need just to access the data and get updates, as well as of content creators, i.e. those having the rights to add new content to the repository from remote.

The Lab scenario

In our lab, we have large datasets - terabytes of data - which also comprise many large files and code that generated part of the data. Such datasets are kept in a storage server. Small portions of the data are frequently needed on local desktop and laptop computers of lab members and collaborators, for processing and analysis. Moreover, new data are frequently generated by lab members, on local computers, by further processing of available data. One main aim is to share such new data and the code with others. Additionally, the data already shared are not static: from time to time, code is updated or bugs are fixed, so some of the preprocessed data is re-generated and shared, substituting the previous version. In such a setting, is is important for lab members and collaborators to get updates of data and code in a simple way.

What is git-annex?

Simply put, git-annex is an extension of git that provides some extra functionalities:

  • Large files in the repository are not locally copied, when cloning or fetching/pulling. Of course, they can be retrieved on request. Additionally, local copies of large files can be removed to free some space.
  • git-annex keeps track of how many and where copies of each file are.
  • TODO

Here, we do not describe git other than the most used version control system, to our knowledge. There is an enourmous amount of information already available about git. git helps keep tracks of updates of files and support collaborative work among multiple users. Unfortunately, git do not provide native support to handle large files in a convenient way. That is what git-annex adds to it.

Versions and other technical details

In this tutorial we refer to git-annex version 6.20171211, on GNU/Linux machines using Ubuntu 16.04. As of this version of git-annex, the default format of the repository is v5. In future, we plan to upgrade to v6, following the default settings of git-annex. At that time, we plan to update the parts of this tutorial that are affected by this change.

When issuing git-annex from the command line, two alternative ways can be used, either git-annex or git annex. To our knowledge, there is no difference between them.

Until v5 of the repository format, git-annex uses certain filesystem features that may not be available on all filesystems, like symbolic links and FIFOs. For example, the FAT filesystem does not provide them. When initializing the repository with git annex init (see below for further details), a clear warning will appear on the screen, in case you are using such crippled filesystems. Nevertheless, git-annex has ways to (partially) address such problems. In this tutorial, we do not discuss such issues and we assume that a non-crippled filesystem is available, like the EXT4 filesystem, default on GNU/Linux systems.

Alternatives to git-annex

TODO

Setting-up a centralized repository

In the following example, we create a directory /labdata on a storage-server, where we store a copy of all the data with git and git-annex, so that they can be shared with lab members and external collaborators. The repository hosts both the git database, in /labdata/.git/, and a copy of the actual files and directories, the working tree, in /labdata/, for easy browsing.

Permission to add or modify the data in the repository is enforced through filesystem permissions by creating a group of users, named dataowners. Everyone else can (only) read the data in the repository.

Here we describe the step-by-step procedure to create the repository from scratch, with example commands followed by their detailed explanation:

cd /
mkdir labdata
addgrup dataowners
adduser contributor dataowners
chgrp dataowners labdata
chmod g+rwx labdata
chmod o+rx-w
chmod g+s labdata
cd labdata

This first group of commands creates the directory to host the repository /labdata, creates a new system group dataowners and sets such group to /labdata, with write permissions. Then, the user contributor is added to that group - and others may be added in the same way. Additionally, read (r), write (w) and access (x) permissions are granted to the group (g+rwx) and read and access (but not write) permissions are granted to everyone else (o+rx-w). Finally, the setgid permission is enabled for the group (g+s), so that all future files and directories created inside /labdata will automatically inherit the group dataowners and the setgid bit.

git init --shared=group
git annex init storage-server

This second group of commands creates the git repository and the additional git-annex part of it. Notice that, the git-annex part of the repository can only be initialized within an existing git repository. In order to let the repository be group-writable and accessible to everyone, the initialization of the git repository requires --shared=group. This will properly set permissions within /labdata/.git/. The initialization of git-annex creates a /labdata/.git/annex/ directory, called the annex, where git-annex stores all its information. To conclude, we added the optional storage-server description when initializing the git-annex part of the repository. This is convenient to set a desired human-readable label to the repository.

At this point, content/changes can be added to the repository in two main ways:

  • Directly on the storage-server, by copying files and directories in /labdata and then:
    • either via git annex add <file> and git commit -m <message>. In this case, The file is added to the annex, i.e. moved to /labdata/.git/annex/objects/, set read-only, renamed according to its checksum and a symbolic link pointing to it is created in the original location of the file. Only the symbolic link is added to the git git repository, while git-annex keeps track of the content. From the user perspective, the initial file is still accessible, through the link, in read-only mode. Notice that, when cloning this repository, only the symbolic link of this file will be present and not its content, unless explicitly requested.
    • Or via git add <file> and git commit -m <message>. In this case the file is added to the git repository and not to the annex. Notice that, when cloning this repository, a copy of this file will be present, as always with git.
  • From remote repositories, through git push or git annex sync. In this second case, the repository must be configured properly, as explained below.

Using git annex add <file> instead of git add <file> can be decided for each file, individually, and depends on the purpose of the file and of the repository. Typically, code should be added via git add <file> and data via git annex add <file>. Nevertheless, it is possible to use git annex add <file> for everything. If, at a later stage, a file needs to be moved from the git repository to the annex, or viceveresa,

Here follows an example transcript of what happens when executing git annex add <file> on a file foo present in the repository:

> ls -al
total 16
drwxrwsr-x  3 ele  dataowners 4096 dic 26 16:19 .
drwxr-xr-x 26 root root       4096 dic 26 16:13 ..
-rw-rw-r--  1 ele  dataowners    4 dic 26 16:19 foo
drwxrwsr-x  9 ele  dataowners 4096 dic 26 16:18 .git
> git annex add foo
> ls -al
total 16
drwxrwsr-x  3 ele  dataowners 4096 dic 26 16:21 .
drwxr-xr-x 26 root root       4096 dic 26 16:13 ..
lrwxrwxrwx  1 ele  dataowners  178 dic 26 16:19 foo -> .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
drwxrwsr-x  9 ele  dataowners 4096 dic 26 16:21 .git
> git commit -m "added foo"
[master (root-commit) 3e461c6] added foo
 1 file changed, 1 insertion(+)
 create mode 120000 foo
> ls -al .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
-rw-rw-rw- 1 ele dataowners 4 dic 26 16:19 .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730

Branches created by git-annex

TODO: explain the branches below

> git branch -a
  git-annex
* master
  synced/master
  remotes/origin/HEAD -> origin/master
  remotes/origin/git-annex
  remotes/origin/master
  remotes/origin/synced/git-annex
  remotes/origin/synced/master

Allowing content creation from remote repositories

Content and changes can be created on remote clones of the repository, i.e. local computers of lab members and collaborators. Such contents and changes need to be pushed to the storage-server, in order to be shared. For this reasons, the storage-server needs to be properly configured in order to allow that, in two steps. The first is:

git config receive.denyCurrentBranch updateInstead

With this command, we allow remote users to push to the repository. Normally, this is not permitted, because the repository is non bare, i.e. it has a working tree of files and directories, besides the .git/ database. If you do not plan to push changes from remote, then you do not need this configuration. Notice that, if you push changes to the repository after enabling the previous configuration, the working tree of the repository will not be updated. See below how to enable the automatic update of the working tree. The second step is:

cd /labdata
git annex wanted . standard
git annex group . backup

The storage-server is meant to keep copies of all files in the repository. When content is created remotely, it is very important to tell the storage-server to enforce this desideratum, when operations like git annex sync are performed (see TODO). In order to enforce such behavior and other similar ones, git annex provides rich expressions to be set, see. But it also offers standard groups of preferences. The commands above tells to all the repository to use a standard group of preferences called backup, which means "All content is wanted. Even content of old/deleted files".

More on standard groups

If you want to check whether standard groups are enabled in the repository, you just need to use the commands above, without specifying standard and backup. The following trascript shows an example:

> git annex wanted .
standard
> git annex group .
backup

Notice that you can set multiple standard groups, whose effect is left as exercise to the reader. Continuing the previous example:

> git annex group . client
group . ok
(recording state in git...)
> git annex group .
client backup

If you added a standard group by mistake and want remove it, you need to use git annex ungroup, as here:

> git annex ungroup . client
> git annex group .
backup

Adding public accessibility from the web

Information on how to access the repository when the storage server directory with the data is exposed via web server.

Basically, git update-server-info should be executed whenever the repository is remotely updated, e.g. via push, or after a local commit. In order to do that automatically, git hooks must be enabled:

cd /labdata
mv .git/hooks/post-update.sample .git/hooks/post-update
cp -a .git/hooks/post-update .git/hooks/post-commit

Warning: hooks needs to be executable.

Add new special remote via http.

git remote add httpdata HTTPURL/.git
git annex initremote datasrc type=git location=HTTPURL/.git autoenable=true
git annex merge  # necessary?
git remote rm httpdata  # not needed anymore

Note: after pulling/syncing in remote clones, git annex init should be re-run, according to the man page. Maybe it is necessary to run git annex enableremote datasrc on the user computer. TODO.

Publishing the repository on github.com

TODO

The idea is to keep a copy of the repository on github, without the contents of the annex, so that it is more visible and can be easily cloned by anonymous users. Moreover, it can be set up so that content can be retrieved via git annex get <file> leveraging the access to the storage-server and/or the public access for the web.

....create repository on github....
git remote add github <github-URL-to-repository>
git push -u github master
git push -u github git-annex

Moving content from git-annex to git and viceversa

After populating and using the repository, it is common to realize that it may not be smart to have all files stored with git-annex and that is would be better to have them simply stored in git. The following commands migrate files from git-annex to git:

git unannex <file>
git add <file>
git commit -m <message>

Notice that git unannex <file> does not need a commit.

Viceversa: TODO.

Problems with permissions when pushing/syncing

TODO

git-annex for users

In this section, we describe the use of git-annex from the point of view of users, when the centralized repository is already available. We make a distinction between users that just access the repository to obtain the data and, from time to time, the updates, from users that contribute to the repository, by creating new content or code to be sent to the central repository.

As user, the first action to do is to clone the repository hosted on the storage server. Notice that repository may be reached in several ways, like via SSH, if you have an account on the storage server, or via HTTP, if the repository has been published with a web server, or via Github if this option has been set up. In this last case, the content of the files in the repository is not available and at least one of the other means should be available to reach the content.

git clone user@storage-server:/labdata

The directory labdata/ is then created, with all the tree of directories and symbolic links to the (missing) content of the files, if they had been added with git annex add <file>, or the actual files, if added with git add <file>. Additionally, as in every git repository, it is present the labdata/.git directory hosting all the git history and internal files. Notice that the directory labdata/.git/annex, created by git annex, is not present yet. Still, the information necessary to git-annex to retrieve the content of the files in the annex is already available because it is stored in the git-annex branch. The list of all available branches shows it:

> git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/git-annex
  remotes/origin/master
  remotes/origin/synced/git-annex
  remotes/origin/synced/master

For this reason, the content of the files currently appearing just as broken links can be easily retrieved with:

git annex get <file>

where <file> is a filename, a directory, or an expression with wildcards that address the content we require.

From time to time, the user can retrieve updates of the repository by executing:

git pull

The user can also ask git-annex information on where to find the content of a given file:

git annex whereis <file>

data/code contributors

If a user is also a contributor to the repository, then he/she can create new content and push it to the repository on the storage server. In order to do that, some additional steps should be done on the local clone of the repository. For clarity, the following instructions start from cloning the repository as contributor:

git clone contributor@storage-server:/labdata
cd labdata/
git annex init contributor1-desktop

here, git annex init <label> is not mandatory but it is good practice for a collaborator to add a human-readable label to describe the local repository, because it will show up in the information stored by git-annex and shared with others.

A second important step is to inform git-annex that the local repository should only get the content explicitly requested by the collaborator. This is important when, later, the contributor will send new content to the main repository on the storage server, with git annex sync. git annex provides a rich and flexible set of expressions to set the preferences of content automatically retrieved during certain operations. See [allow-remote-content] for a more detailed explanation. Here, the main step is to set the preferences of the content for the local repository to a standard group, called manual, meaning that content will only be manually retrieved by the contributor via git annex get <file> and manually removed when needed with git annex drop <file>:

cd labdata/
git annex wanted . standard
git annex group . manual

At this point, the contributor can create new files and add them to the annex, via git annex add <file> and commit that:

...creating new files...
git annex add <newfiles>
git commit -m "created <newfiles>

At this point, local changes can be sent to the repository on the storage server with git push but the content will not be sent, only the symbolic link and some metadata. In order to copy the content of the file to the annex on the storage server, git-annex provides the command git annex copy <newfiles> --to=origin. Notice that, instead of indicating the specific filename, it is sufficient to indicate the name of the directory with the new files, when copying, and git annex will figure out what content will need to be copied, e.g.:

git annex copy . --to=origin

Since pushing content to a repository often requires to pull first and merge changes, then git-annex provides a more convenient way to perform all these operatations, through the sync command:

git annex sync origin --content

Internally, git annex sync --content performs the following steps:

  1. git commit
  2. git pull
  3. git merge
  4. git push
  5. git copy . --to=origin
  6. git annex get .

Notice that the last two steps will be avoided if --content is omitted. Moreover, had the standard group manual not being set in the local repository, then all files available on the storage server would have been copied locally. Anyway, if that happens, interrupting the retrieval with CTRL-C is safe.

Editing pre-existing files

If a <file> is stored with in the annex and changes to it needs to be made, then the file must be unlocked first:

git annex unlock <file>
...edit...edit...edit...
git annex add <file>
git commit -m "updated <file>"

Notice that git annex unlock <file> removes the symbolic link and copies the content of the file in its place, with write permission. This is a second copy of the file because the one . After changing the file, git-annex add and git commit can be performed as usual. Notice that, if you need to frequently change a file, it may be more convenient to store it with git instead of git-annex.

Issue: Changing a file without unlocking first

What happens if you attempt to edit a file without unlocking first? Files added with git-annex appears as symbolic links in the filesystem. An application, such as an editor, should warn that you are opening a link and not a file. Secondly, the content of the file, pointed by the link, is stored in .git/annex/objects/ and set as write-protected. This is the only copy of the content of the file in the local repository, that is why it is protected. The application attempting to write on this file should either fail, with permission denied, or clearly ask confirmation to write on a write protected file, e.g. Sublime Text 3. If the user insists to write on the file and the application allows that, basically the internal copy of git-annex is damaged. With git annex fsck <file>, git-annex will tell first that the local copy of the file is not good anymore and will put it in .git/annex/bad/. In order to solve such a situation, it is necessary to retrieve a pristine copy of the file, with git annex get <file>, then unlock it, re-editing again or copying the the file in .git/annex/bad/ on the unlocked file, then adding and committing.

git-annex for anonymous users

git clone http://storage-server.mydomain.com/labdata
cd labdata
git annex get <files>
[...]
git pull
git annex get <files>

Acknowledgment

Thanks to Michael Hanke's post, for inspiring parts of this tutorial and showing interesting solutions.

Thanks to Yaroslav Halchenko and Michael Hanke for their continuous effort in improving and maintaining NeuroDebian which, among many other things, provides Debian/Ubuntu repositories with the latest git-annex, within the package git-annex-standalone.

A special thank to Joey Hess, author of git-annex, for the beautiful and intriguing piece of software that sometimes tease us like a puzzle, like git does.