attic Traceback on a frequent backup

I got the following traceback on a frequent backup task (running every 5 minutes). The failure happened only once. Subsequent backup tasks succeeded normally, so not a big issue for me. I am just posting it, since maybe there is some corner case that can be handled better.

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 129, in do_create
    archive.save()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 198, in save
    self.cache.commit()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
    os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 2] No such file or directory

Dec 22 '14 14:12 Ernest0x

Let me expand a little bit on this...

I am using the same (remote) repository for two different backup tasks, separated with a different archive prefix. One backup task is scheduled to run very frequently (every 5 minutes), while the other is scheduled to run only once per day. Both backup tasks include a 'prune' command after the 'create' command to remove older archives.

The traceback in the first post happened only once, on a 'create' command for the frequent backup. The other backup, the daily one, which takes more time, while it always succeeds on the 'create' command, occasionally it fails on the 'prune' command, with the following traceback:

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 350, in do_prune
    cache.commit()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
    os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 39] Directory not empty

So, I am thinking that it may exist some kind of race condition related to how cache transaction directories are handled by multiple processes trying to access the same repository. @jborg could that be the case?

Here is the output of 'attic list' for the two backup tasks just before the time that 'prune' command fails:

daily-archive_2014-12-27-02:15 Sat Dec 27 02:18:17 2014 frequent-archive_2014-12-27-02:16 Sat Dec 27 02:18:18 2014

As you can see, the 'create' commands of the two backup tasks finished very close in time. So, 'prune' commands started just after 'Sat Dec 27 02:18:17 2014' and 'Sat Dec 27 02:18:18 2014' respectively.

Dec 27 '14 10:12 Ernest0x

I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?

Jan 03 '15 22:01 jborg

I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?

No, there isn't any kind of unreliable filesystem. EXT4 is used on both sides.

As I have posted on issue #167, my very frequent backup task was taking pretty much time because it executed one 'list' command before the 'create' command and one 'prune' command after the 'create' command. Now I have improved this by not calling the list operation. Since that change, I have not seen errors of the kind described on this issue. That makes me think that maybe these errors were more likely to happen when "overlapping" executions of 'list-create-prune' backup tasks were in action. Reducing the time needed for the backup task by actually reducing the chain 'list-create-prune' to 'create-prune' (with 'create' being pretty fast) makes it less possible to have overlapping executions. That said, I know that the operations acquire an exclusive lock to the repository, so that overlapping normally should not be a problem.

Jan 04 '15 12:01 Ernest0x

To clarify (as it is not visible from the 2 tracebacks):

The operation that fails in both cases is:

    os.rename(os.path.join(self.path, 'txn.active'),
              os.path.join(self.path, 'txn.tmp'))

Mar 08 '15 01:03 ThomasWaldmann

I am still getting these tracebacks in my 'create & prune' frequent backup task on the repository mentioned above. The traceback on the 'prune' operation is somewhat different:

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 346, in do_prune
    archive.delete(stats)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 333, in delete
    self.cache.chunk_decref(items_id, stats)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 198, in chunk_decref
    self.begin_txn()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 96, in begin_txn
    os.path.join(self.path, 'txn.active'))
OSError: [Errno 39] Directory not empty

On a side note, I wish the prune operation was done at the server side on a remote repository. If you have a big enough repository, it makes SO much traffic the way it is implemented right now and it is SO slow, that a frequent backup task is practically not feasible. Next invocations are waiting for the previous to complete. I also suspect that the traceback above are triggered somehow by these create/prune processes piling and "fighting" for getting access to the repo (although the locking mechanism should protect from this).

Mar 17 '15 16:03 Ernest0x

@Ernest0x about too much traffic / serverside prune: rather create a separate ticket about that.

Mar 17 '15 21:03 ThomasWaldmann

I suspect this issue is caused by an internal inconsistency about whether a transaction is active or not. The code in question changed after 0.13, so it might be that this error does not happen any more.

I also did a fix related to transaction state of a remote repo server process that could be related, see there:

https://github.com/borgbackup/borg/blob/0b9dcbf8e529d0548d99d75c980c1913594d68ac/borg/repository.py#L174

Oct 16 '15 14:10 ThomasWaldmann