internet-in-a-box Smart dataset update rsync script

We need an update script that is a little bit smart about versions. We need a version scheme to stamp a dataset module with a version date. An update script should be able to use rsync over the network to retrieve the version dates, detect the modules available, and somehow allow the user to specify that a particular module should be updated (via rsync).

Just running rsync periodically takes too long on our large datasets with many files, and we will need to be smart when we have to update deployments.

May 13 '13 14:05 braddockcg

Is the data download & processing automated? gnu make will timestamp when it runs.

Feb 25 '14 04:02 paperdigits

This sounds like a good use case for a DVCS like Mercurial. It has been a while since I researched a case like this, but IIRC Mercurial has built in methods and extensions for synching workspaces (IIAB copies) against a repo, and updating them from delta packages (which are smaller and faster than rsync checksum and delta sets), which can be shipped via offline storage or over a network. In any case, the binary delta storage format and delta set push/pull methods used by Mercurial are more amenable to this usage than Git. Versions can be tagged with individual workspace (IIAB copy) labels for per-deployment tracking.

Feb 28 '14 13:02 Admin-DataRoads

@JH-DataRoads Checking ~700GB into any DVCS sounds like a nightmare.

I still haven't had time to setup a test server (next week, I hope), but has anyone looked at git-annex for storing all the data? https://git-annex.branchable.com/

Feb 28 '14 17:02 paperdigits

@paperdigits I've dealt with similarly large VCS repositories in the video game industry without any trouble, primarily in Subversion. Mercurial based its on-disk formats on Subversion so performance should be similar. The Xdelta algo they use in the repo is indifferent to binary or text types, but none of the lossy or compressed files should go there (only raw/source files). Git is a total non-starter because it's optimized for text blobs. The receivers would just use working copies (no history) and delta packages for version updates.

Of course rsync doesn't work well with lossy or compressed files either, because the checksums won't match and it will resend the whole file on any changes, so those files can just be sent whenever there is any timestamp difference. For online mirrors, ZFS or BTRFS are also better than rsync for mirroring because COW can keep mirror snapshot sync to the block level without redundant checksum checks or resends. New mirrors should use something like Bittorrent Sync or Magnet URI managed downloads of repository images on the receiving end.

Mar 01 '14 01:03 Admin-DataRoads

Git is a total non-starter because it's optimized for text blobs.

Not just git, but git-annex. A solution to checking large files into git, kind of.

Mar 03 '14 00:03 paperdigits

git-annex was mentioned to me before. I am open to learning more about it.

Keep in mind this is a very large dataset - just reading 700 GB from disk takes many hours. It may be impractical to try to detect content changes aside from relying on file timestamps.

We currently have a JSON file in each module directory containing a manually entered date to act as a version number.

Mar 03 '14 04:03 braddockcg

A brief overview of git-annex:

git-annex manage a set of system links, files, and file hashes. The system links and hashes are stored in git, while the actual file is store in a subfolder of the .git. Git annex provides a way to transfer (rsync) binary files and check the binary integredy (git-annex fsck).

When you run git-annex add <file> git-annex hashes the file, copies it into .git/.annex/uuid and adds a system link to the working directory. You then run a regular git commit and the system link and hash are stored in git. System links are read-only by default.

The dev of git-annex is also a debian dev, which is awesome because the latest version of git-annex is usually in wheezy-backports or jessie. He is also very responsive in the forums at http://git-annex.branchable.com/

There is a very good walk thru on the website as well.

If this still sounds like a solution you'd want to explore, I can make a forum post at http://git-annex.branchable.com/ asking about the scenario specific to this project.

Mar 08 '14 04:03 paperdigits

Git-annex and this Mercurial feature appear to be very similar: http://mercurial.selenic.com/wiki/LargefilesExtension

Apr 10 '14 01:04 Admin-DataRoads

@Admin Yes mercurial large file extension and git annex do seem similar, but git annex is not considered a "feature of last resort" or a "feature you should consider avoiding."

IMHO, that is a bit unnerving.

Apr 10 '14 22:04 paperdigits

@paperdigits Mercurial devs are just wary of anything that isn't diff or versioning friendly (that could actually be said of VCS devs in general). They wouldn't distribute it with the base package unless it was stable. They really just mean you should always try to version-track the raw or source files, because no matter how big they are they are still diff-able. Generally projects should only generate [lossy] compressed files at build or install time, and use standard file distribution tools outside the VCS with the resulting files. Build and compression outputs intrinsically inherit the versions of their input sources, so tracking their versions separately is normally considered redundant and unnecessary.

Apr 10 '14 23:04 Admin-DataRoads

internet-in-a-box internet-in-a-box copied to clipboard

Smart dataset update rsync script

internet-in-a-box
internet-in-a-box copied to clipboard