freight icon indicating copy to clipboard operation
freight copied to clipboard

symlinks make uploading apt repos to s3 inefficient

Open joemiller opened this issue 13 years ago • 12 comments
trafficstars

We are using freight to manage apt repos for the Sensu (https://github.com/sensu) project. It's a great tool but I noticed it's not all that efficient when repos are sync'd to S3 for serving.

This is due to the way symlinks are used by freight.

Time permitting I would like to submit a PR to address this, but I don't have time at the moment. I figured I would drop a note here in case you had additional thoughts or suggestions.

thanks! great tool btw

joemiller avatar May 04 '12 02:05 joemiller

I'm running into this today as well. I'd love to have a good fix, but I'm not sure exactly what it would be yet.

stahnma avatar Oct 02 '12 23:10 stahnma

+1 for this

abecciu avatar Nov 28 '12 18:11 abecciu

Define inefficient: latency/performance, space, or something else? As far as I can tell it works but duplicates files because S3 is not actually a filesystem and thus doesn't understand hard links which are how Freight does almost all of its bookkeeping.

rcrowley avatar Jun 22 '13 19:06 rcrowley

The problem I have bumped into is syncing the cache dir up into a s3 bucket, using s3cmd sync uploads only the actual data-dirs underneath dists, say raring-2013..., which makes ubuntu fail to recognize the dists/raring/Packages file when running apt-get update. It would be great if freight had a config option that would just collapse all the files under dists into dists/raring, dists/precise etc.

Edit: Ok, now I see what the other guys probably meant: Using s3cmd sync -P --follow-symlinks actually makes this work but leads to all the files being uploaded multiple times as all symlinks get resolved into hard copies.

colszowka avatar Jul 16 '13 15:07 colszowka

Amazing how writing about an issue makes you (finally) think about it: Using an exclude pattern of "dists/*-201*" made this work for me as expected - only the actual distro-specific base dirs will get synced, and thanks to the aforementioned symlink expansion all the required files should be there (so no duplication / duplicate upload of files on s3 should happen).

To do a full, no-duplicates sync to s3, this works for me: s3cmd sync -P -F --exclude "dists/*-201*" --exclude "*.refs*" PATH/TO/freight/cache/ s3://BUCKET_NAME

Update: You should also exclude '_.refs_', as those will get uploaded as duplicates as well and are not referenced from any of the packages index files. Modified the above command accordingly

The -P makes the files publicly readable, the -F is a shorthand for the --follow-symlinks option, and the exclude skips freight's housekeeping-dirs while syncing.

By the way, making the bucket itself work with apt did not require any additional configuration steps, I just built it using s3cmd mb --bucket-location=EU BUCKET_NAME (since I wanted it to be hosted in the ireland AWS cloud)

colszowka avatar Jul 16 '13 16:07 colszowka

@colszowka awesome tip! I had not considered that.

joemiller avatar Nov 18 '13 15:11 joemiller

I am currently getting around the problem in this manner:

DISTSDIR=/var/cache/freight/dists [ -d "$DISTSDIR/saucy" ] && rm -rf $DISTSDIR/saucy freight add *.deb apt/saucy GPG="[email protected]" freight cache NEWEST=readlink $DISTSDIR/saucy rm $DISTSDIR/saucy && cp -r $DISTSDIR/$NEWEST $DISTSDIR/saucy

This way the distribution directory ("saucy" in this case) is an actual directory, not a link (or a versioned link for that matter), which then is transferred nicely to S3 during a sync.

SunSparc avatar Dec 04 '13 00:12 SunSparc

Just a quick question, why are there dist/name-timestamp directories at all? I couldn't find a reason for it or any mention in the man pages.

They allow Freight to present an always-intact repository to the world.

rcrowley avatar Oct 05 '14 00:10 rcrowley

You mean like a snapshot?

Not really. If we didn't do something like this there'd be a (very brief) window during freight cache in which the entire archive would respond 404 for clients. By changing a symbolic link this becomes an atomic operation and clients will never experience failures during archive changes. (This isn't strictly true in case of packages being removed from the archive but, of course, you want those packages to go away so the 404 responses are warranted.)

rcrowley avatar Oct 11 '14 17:10 rcrowley

And any reason not to remove the old snapshots?