acd_cli support for gocryptfs, ecryptfs, borg, duplicity, and rsync

Edited summary follows since this PR thread has gotten long.

First, please direct any issues you have with this PR here so this thread doesn't get any more out of control: https://github.com/bgemmill/acd_cli

This PR provides two primary features to allow rsyncing into a layered encrypted filesystem:

out of order rewriting
mtime support

And a few caches to make the above features performant:

node_id to node (in memory)
path to node_id (in memory)
the content of small files and symlinks (in nodes.db)

With those implemented, it was pretty simple to add a few other things to flesh out our filesystem support:

uid/gid/mode
symlinks
fs block size for du and stat

The rationale for out of order rewriting is that most encrypting file systems maintain a header around the beginning of the file that gets updated as the rest of the file is written. This means that write patterns typically look like sets of [append to end, overwrite at beginning]. I'm solving this issue by using a write-back cache that stores file writes in a SpooledTemporaryFile until all file handles are closed, and only then pushing to amazon.

The rationale for mtime is that rsync uses it for file equality testing. I'm implementing this by using one of the 10 properties an amazon app gets to store all file xattrs as a json object. Once mtime and xattrs were in place, it was straightforward to add the others.

Considerations:

The SpooledTemporaryFile will keep writes smaller than 1G in memory; opening, writing, and not closing a lot of files below that limit will use up a lot of ram.
Because pushing to amazon happens on file handle releases and not write calls, expect writes to appear very fast; the actual work happens later.
Amazon has been reducing the length of properties that it will allow. Setting many or long xattrs will yield errors.
Because the write back caching is triggered off file handle counts going to zero, mmap will not work as intended.
Due to how fusepy handles timestamps, there are some files that rsync will think are always changed. https://github.com/terencehonles/fusepy/issues/70

Please enjoy, and let me know if anything goes wrong!

Original post

Ecryptfs has two properties that we need to overcome in order to get it working with acd_cli.

Luckily, this PR addresses both :-)

ecryptfs writes files 4096 bytes at a time, using a different file handle each time. This PR allows multiple file handles to share a write buffer if they all write sequentially. To make this performant for large files (large numbers of file descriptors), I've added some lookup caching to how nodes are obtained.
ecryptfs wants to write a cryptographic checksum at the beginning of the file once it's done. We could either buffer everything before sending, which would be memory intensive for big files, or we could have ecryptfs store this checksum in the file's xattr instead. I've opted to go this route, which required implementing xattrs over ACD using one of our allowed properties.

Additionally, ecryptfs is extremely chatty about when it decides to write to this buffer. To deal with this, xattrs are marked as dirty and only sent over the wire when any file has all of it's handles closed, or when fuse is unloaded.

With these changes, I can get about 80% of my unencrypted speed to ACD at home using an encrypted mount. If everything in this PR looks good, I have a few ideas of where to push that a bit more.

Please let me know if I grokked the fusepy threading model properly, that's the piece I was the least sure about, especially how safe/unsafe some things were with the GIL.

Aug 08 '16 00:08 bgemmill

Addresses issue: https://github.com/yadayada/acd_cli/issues/368

Aug 08 '16 00:08 bgemmill

not bad, would be nice to have a configurable lru style cache to help with reads/writes (with some kind of read ahead)

Aug 08 '16 00:08 cyberbalsa

@yadayada looks like the buildbot needs a new oauth token to test properly. I see this in the logs: CRITICAL:acdcli.api.oauth:Invalid authentication token: Invalid JSON or missing key.Token: {"refresh_token": "bar", "expires_in": 3600} 16-08-08 00:06:11.286 [CRITICAL] [acdcli.api.oauth] - Invalid authentication token: Invalid JSON or missing key.Token:

Aug 08 '16 04:08 bgemmill

I've implemented proper mtime handling in one of the xattrs so that rsync over acd_cli can work as expected. This addresses: https://github.com/yadayada/acd_cli/issues/58

Aug 10 '16 22:08 bgemmill

Why not "backport" the write buffer feature as a general write-back cache for acd_cli? That'd fix problems with ecryptfs, encfs and any other applications where data is appended in small blocks (and overloads the acd_cli API eating 100% cpu).

Aug 11 '16 13:08 Thinkscape

Turns out that ecryptfs has a subtle bug when it stores its crypto headers in xattrs; it reports file size incorrectly on the next time it's mounted: https://bugs.launchpad.net/ecryptfs/+bug/1612492

That means rsync will behave properly only if your mount has perfect uptime! :-)

Until they fix that, I've allowed the acd fuse mount to overwrite the first few bytes of a file where the crypto header would go. Because we still need to write to amazon sequentially, I'm solving this by storing the header in xattr space, and splicing it back into the byte stream on read. This still seems better than requiring whole files to be kept in memory until fully written.

Aug 13 '16 20:08 bgemmill

I've finally gotten rsync, ecryptfs, and acd_fuse playing nice together. There were enough corner cases around rsync flags I can't control (thanks Synology!) and some older versions of the kernel that make ecryptfs call useless truncates before flushing (thanks Synology!) that the best way to make it all go is to build a write buffer in memory until all the interested file handles are closed. This allows multiple writes to the same offset, out of order writes as long as nothing leaps forward with a gap, and eliminates the hack of putting encrypted headers into xattr space.

Further work will be to use temp file backing rather than memory backing if individual files get too large.

Aug 16 '16 05:08 bgemmill

Further work will be to use temp file backing rather than memory backing if individual files get too large.

@bgemmill this is covered in #314 and is not ecryptfs specific. It'd help with performance and other apps which write file handles non-linearly. It'd be awesome if you could port the write buffer feature as separate PR (separate flag/option) which this one can depend on.

hint: https://github.com/redbo/cloudfuse/blob/master/cloudfuse.c#L256-L289

Aug 16 '16 06:08 Thinkscape

@Thinkscape I'm only going to pursue the file backing if the write memory backing is too memory intensive. At the moment this PR makes both ecryptfs and rsync work properly, uses memory for only the files being written at any given moment, and that seems like a good place to leave it.

The way I'm looking at it is that this PR is the one that the file backing PR should depend on.

File caching is going to require a bit of thought too, because unless we're smart about LRU like @jrwr pointed out, we'd end up doubling the on-disk space in the process of rsyncing to Amazon.

Aug 16 '16 13:08 bgemmill

LRU cache is something different to what I meant.

The caching Swift FUSE does is per file handle - a process opens a file handle for writing, writes as much or little as it likes and closes the handle. That's what most rsync-like streamers and updaters will do.

Of course memory backing will be too memory intensive. If you attempt to rsync or random-write a 8GB file, it'll gladly consume 8+ GB of RAM.

Aug 16 '16 17:08 Thinkscape

@Thinkscape Thanks for clarifying. @jrwr's point as I understood it was what do you do with that temporary file once you're done. Delete it immediately, keep it around for faster reading, LRU, something else?

As to memory backing, I'm in the middle of going through my wedding videos, and haven't seen a huge hiccup. I'd imagine that's virtual memory doing what you suggest with swapping; I'll have more info tomorrow when my rsync job finishes.

Looking at the job in the middle of today: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 4212 376 300 S 0.0 0.0 0:00.07 minit
22 root 20 0 902184 339612 4892 S 0.0 4.2 117:54.59 acd_cli
30 root 20 0 11128 1076 896 S 0.0 0.0 0:00.06 rsync
923 root 20 0 25956 2532 1216 S 0.0 0.0 0:01.09 rsync
924 root 20 0 26224 1756 260 S 0.0 0.0 26:21.43 rsync
2898 root 20 0 18228 1836 1436 S 0.0 0.0 0:00.04 bash
2914 root 20 0 36660 1716 1256 R 0.0 0.0 0:00.00 top

For me, the steady state usage seems to be about ~400M for this docker image on an 8G box, and a few big files passed through since virtual is around 900M now. Caveat: this is an instantaneous measure rather than peak, and I don't know what reserved was when the big file went through.

I can tell experimentally that this hasn't ground to a halt on swap or thrown python MemoryErrors. We'll see how the rest of the day goes.

Once it finishes I'll look more.

If you want to give it a go before then, fire up a docker container with 6G ram limit and do: dd if=/dev/urandom of=file.blob bs=1MB count=8000 rsync file.blob /amazon/

Aug 16 '16 18:08 bgemmill

If you want to give it a go before then, fire up a docker container with 6G ram limit and do: dd if=/dev/urandom of=file.blob bs=1MB count=8000 rsync file.blob /amazon/

Yeah, but why? If it buffers it in RAM, of course it'll die with a big file. Furthermore, I do not expect or want my tools to eat up all my server's RAM depending on what it stumbles upon in dir tree. It must not do that, regardless of what I upload to ACD, it's just not the way to go...

Aug 16 '16 19:08 Thinkscape

@Thinkscape It turns out if you run that example you'd see what I did; no real performance hiccups because the docker memory clamping forces the older bits of big buffers into swap. File backing the old-school way.

To make this change set more palatable to non-docker users of fuse, I put in file backing if writing gets too large. At the moment the default is 1G.

On a different note, it looks like Synology's rsync directory querying fails when directories contain around 10k things; that many calls to getattr take too long for a timeout. I'm going to tackle that next since everyone probably wants 'ls -al' to complete quickly.

Aug 17 '16 17:08 bgemmill

To make this change set more palatable to non-docker users of fuse, I put in file backing if writing gets too large. At the moment the default is 1G.

Thanks. We cannot depend on any specific virtualization or OS feature to automagically manage memory for us. Rsync usually takes just a few megs of RAM regardless of the tree size or individual files' grandeur, and that's what I'd expect from a fuse driver as well. Even 1G seems excessive to me, but at least it's configurable.

Aug 17 '16 18:08 Thinkscape

@bgemmill any special requirements? It crashes on startup. Init doesn't create it either :|

Getting changes
16-08-18 10:27:39.754 [ERROR] [acd_cli] - Traceback (most recent call last):
  File "acd_cli.py", line 223, in autosync
    sync_node_list(full=False)
  File "acd_cli.py", line 161, in sync_node_list
    cache.remove_purged(changeset.purged_nodes)
  File "/acd_cli/acdcli/cache/sync.py", line 45, in remove_purged
    c.execute('DELETE FROM properties WHERE id IN %s' % placeholders(slice_), slice_)
sqlite3.OperationalError: no such table: properties

Aug 18 '16 08:08 Thinkscape

Oh, actually it doesn't crash but keeps complaining about that table. Also on xattrs:

16-08-18 10:41:47.432 [DEBUG] [acdcli.acd_fuse] - <- getxattr '[Unhandled Exception]'
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/fuse.py", line 495, in _wrapper
    return func(*args, **kwargs) or 0
  File "/usr/local/lib/python3.5/dist-packages/fuse.py", line 647, in getxattr
    name.decode(self.encoding), *args)
  File "/root/acd_cli/acdcli/acd_fuse.py", line 279, in __call__
    ret = getattr(self, op)(path, *args)
  File "/root/acd_cli/acdcli/acd_fuse.py", line 406, in getxattr
    return self._getxattr_bytes(node_id, name)
  File "/root/acd_cli/acdcli/acd_fuse.py", line 421, in _getxattr_bytes
    return binascii.a2b_base64(self._getxattr(node_id, name))
  File "/root/acd_cli/acdcli/acd_fuse.py", line 409, in _getxattr
    self._xattr_load(node_id)
  File "/root/acd_cli/acdcli/acd_fuse.py", line 457, in _xattr_load
    xattrs_str = self.cache.get_property(node_id, self.acd_client_owner, _XATTR_PROPERTY_NAME)
  File "/root/acd_cli/acdcli/cache/query.py", line 347, in get_property
    c.execute(PROPERTY_BY_ID_SQL, [node_id, owner_id, key])
sqlite3.OperationalError: no such table: properties

The good news is, it seems to work most of the times. Testing under various loads.

Aug 18 '16 08:08 Thinkscape

@Thinkscape short story: you'll want to delete your nodes.db and re-sync.

The way acd_cli sync works now is to get changes since a last snapshot, and you'll find that only gets properties for the newest nodes. That's going to be your most-but-not-all-of-the-time case. Everything works as intended if you delete your nodes.db and resync; then properties will be fetched in the same snapshot way going forwards.

I'd be happy to look at patches if there's an elegant way to do that in a _3_to_4 type db upgrade function.

Aug 18 '16 14:08 bgemmill

I'd be happy to look at patches if there's an elegant way to do that in a _3_to_4 type db upgrade function.

Meh. If they get CREATEd on first sync, it's easier and safer to just unlink nodes.db when v3 is detected.

Aug 19 '16 06:08 Thinkscape

Interesting ... with this branch, my mount randomly unmounts after some time (probably crashes). When uploading I've recently encountered this:

16-08-21 12:48:32.530 [ERROR] [acd_cli] - Traceback (most recent call last):
  File "/root/acd_cli/acd_cli.py", line 248, in wrapped
    ret_val = f(*args, **kwargs)
  File "/root/acd_cli/acd_cli.py", line 516, in upload_file
    rmod = datetime_to_timestamp(conflicting_node.modified)
  File "/root/acd_cli/acdcli/utils/time.py", line 5, in datetime_to_timestamp
    return (dt - datetime(1970, 1, 1)) / timedelta(seconds=1)
TypeError: can't subtract offset-naive and offset-aware datetimes

Aug 21 '16 10:08 Thinkscape

@Thinkscape thanks for the find, those should have all gone away with the xattr mtime work.

Aug 21 '16 13:08 bgemmill

I've noticed something new. After 8h of operation (on 06efeca565720d0e74fadbdbc6f1ff2b2ccaeea4) , the fuse mount became sluggish. All writes would result in >50% wait load with the acd_cli.py only at the usual 1-3% user. Unmounting and remounting fixed it... I'm pulling newest version anyway, maybe it'll go away.

Aug 22 '16 08:08 Thinkscape

Warning! Rename truncates files :-(

Try this:

dd if=/dev/urandom bs=1024 count=1024 of=/tmp/file.dat && \
cp /tmp/file.dat /mnt/acd/ && \
mv /mnt/acd/file.dat /mnt/acd/file-renamed.dat && \
stat /mnt/acd/file-renamed.dat

The file-renamed.dat becomes 0 bytes... it's gone.

Aug 22 '16 17:08 Thinkscape

@Thinkscape I can't reproduce that here; are the files really 0 if you look at amazon's website? Also, does your log have any connection issues?

Aug 22 '16 17:08 bgemmill

@yadayada I was thinking of adding support for chmod/chown (uid, gid, mode) information to be stored like mtime so rsync can preserve those too. New PR or should I get those in here with mtime?

Aug 22 '16 18:08 bgemmill

Damn. It appears the truncation happens only with encfs on top of acd_cli mount. Doesn't happen in either one alone.

Aug 22 '16 18:08 Thinkscape

I also feel like this branch is much more CPU hungry. Could you please verify that for me @bgemmill ? Here's a session of copying a single 444MB file into ACD mount. Notice how high the CPU usage gets when it spools and then transfers the file to the cloud ...

cpu

I've verified that the highest load occurs when writing - filling up the cache (either memory or tmp file). Copying a 7GB file shows a >70% CPU acd_cli activity, minimal IO.

Aug 22 '16 18:08 Thinkscape

@bgemmill Ok, I think I found a more serious problem. WTF is a "writing gap" ? 😲 Every time I write a bigger file (through encfs) I'm getting random input/output errors. I'm not using any tool, just straight cp src dst.

Here's what debug log tells me:

[...]
... many more writes  ...
[...]
16-08-23 12:18:41.301 [DEBUG] [acdcli.acd_fuse] - <- getxattr '[Errno 61] No data available'
16-08-23 12:18:41.301 [DEBUG] [acdcli.acd_fuse] - -> write /k/hGCz3wLgC1b03ey3AJhWrITvmxG48XjCx,JOdz,HidDYeDU8NIkm0rm1UlP9KymVL-dp1DCVWAwL6Jekpc9,Nf30MTLP-iLkEIK3jIh2Srtuh-/QwnbJEJWjPxvTGIoSLtmdaFqP52ptszADag6kGrPdWDlCK1boELWxZmZlpAQigWEu9utSPJ2QMQJRouWythcdiFftdhrS9nedhVnXQX2l9IrF- (1024, 659351560, 7)
16-08-23 12:18:41.301 [DEBUG] [acdcli.acd_fuse] - <- write 1024
16-08-23 12:18:41.301 [DEBUG] [acdcli.acd_fuse] - -> getxattr /k/hGCz3wLgC1b03ey3AJhWrITvmxG48XjCx,JOdz,HidDYeDU8NIkm0rm1UlP9KymVL-dp1DCVWAwL6Jekpc9,Nf30MTLP-iLkEIK3jIh2Srtuh-/QwnbJEJWjPxvTGIoSLtmdaFqP52ptszADag6kGrPdWDlCK1boELWxZmZlpAQigWEu9utSPJ2QMQJRouWythcdiFftdhrS9nedhVnXQX2l9IrF- ('security.capability',)
16-08-23 12:18:41.301 [DEBUG] [acdcli.acd_fuse] - <- getxattr '[Errno 61] No data available'
16-08-23 12:18:41.301 [DEBUG] [acdcli.acd_fuse] - -> write /k/hGCz3wLgC1b03ey3AJhWrITvmxG48XjCx,JOdz,HidDYeDU8NIkm0rm1UlP9KymVL-dp1DCVWAwL6Jekpc9,Nf30MTLP-iLkEIK3jIh2Srtuh-/QwnbJEJWjPxvTGIoSLtmdaFqP52ptszADag6kGrPdWDlCK1boELWxZmZlpAQigWEu9utSPJ2QMQJRouWythcdiFftdhrS9nedhVnXQX2l9IrF- (1024, 659363848, 7)
16-08-23 12:18:41.302 [ERROR] [acdcli.acd_fuse] - Wrong offset for writing to buffer; writing gap detected
16-08-23 12:18:41.302 [DEBUG] [acdcli.acd_fuse] - <- write '[Errno 29] Illegal seek'
16-08-23 12:18:41.302 [DEBUG] [acdcli.acd_fuse] - -> flush /k/hGCz3wLgC1b03ey3AJhWrITvmxG48XjCx,JOdz,HidDYeDU8NIkm0rm1UlP9KymVL-dp1DCVWAwL6Jekpc9,Nf30MTLP-iLkEIK3jIh2Srtuh-/QwnbJEJWjPxvTGIoSLtmdaFqP52ptszADag6kGrPdWDlCK1boELWxZmZlpAQigWEu9utSPJ2QMQJRouWythcdiFftdhrS9nedhVnXQX2l9IrF- (7,)
16-08-23 12:18:41.303 [DEBUG] [acdcli.acd_fuse] - <- flush None
16-08-23 12:18:41.305 [DEBUG] [acdcli.acd_fuse] - -> flush /k/hGCz3wLgC1b03ey3AJhWrITvmxG48XjCx,JOdz,HidDYeDU8NIkm0rm1UlP9KymVL-dp1DCVWAwL6Jekpc9,Nf30MTLP-iLkEIK3jIh2Srtuh-/QwnbJEJWjPxvTGIoSLtmdaFqP52ptszADag6kGrPdWDlCK1boELWxZmZlpAQigWEu9utSPJ2QMQJRouWythcdiFftdhrS9nedhVnXQX2l9IrF- (7,)
16-08-23 12:18:41.305 [DEBUG] [acdcli.acd_fuse] - <- flush None
16-08-23 12:18:41.306 [DEBUG] [acdcli.acd_fuse] - -> release /k/hGCz3wLgC1b03ey3AJhWrITvmxG48XjCx,JOdz,HidDYeDU8NIkm0rm1UlP9KymVL-dp1DCVWAwL6Jekpc9,Nf30MTLP-iLkEIK3jIh2Srtuh-/QwnbJEJWjPxvTGIoSLtmdaFqP52ptszADag6kGrPdWDlCK1boELWxZmZlpAQigWEu9utSPJ2QMQJRouWythcdiFftdhrS9nedhVnXQX2l9IrF- (7,)

Aug 23 '16 10:08 Thinkscape

@Thinkscape writing gaps would have shown up as an illegal seek before this PR; we still require files to be written to sequentially. In order to lift that, we'd have to fully read files before writing with gaps since you could imagine the worst case of opening an existing file, appending one byte to the end, and then closing the file.

While I could dive into fuse to make acd do that, you're probably better off performance-wise using a different encrypted mount.

Aug 23 '16 11:08 bgemmill

I thought we're doing just that after adding the write cache. It waits for writes to finish, the POSTs after it flushes. I fail to see the difference in writing x bytes sequentially then flushing, vs writing x+y bytes to the same fh and flushing.

Aug 23 '16 12:08 Thinkscape

While I could dive into fuse to make acd do that, you're probably better off performance-wise using a different encrypted mount.

Are you implying ecryptfs would behave differently? From what I've read in the docs, the encryption routine is quite similar - each file is wrapped with metadata and encoded, then stored as a (portable) file in the underlying directory.

Aug 23 '16 14:08 Thinkscape

acd_cli acd_cli copied to clipboard

support for gocryptfs, ecryptfs, borg, duplicity, and rsync

acd_cli
acd_cli copied to clipboard