Traceback on a frequent backup
I got the following traceback on a frequent backup task (running every 5 minutes). The failure happened only once. Subsequent backup tasks succeeded normally, so not a big issue for me. I am just posting it, since maybe there is some corner case that can be handled better.
Traceback (most recent call last):
File "/usr/local/bin/attic", line 5, in <module>
pkg_resources.run_script('Attic==0.13', 'attic')
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
main()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
exit_code = archiver.run(sys.argv[1:])
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
return args.func(args)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 129, in do_create
archive.save()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 198, in save
self.cache.commit()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 2] No such file or directory
Let me expand a little bit on this...
I am using the same (remote) repository for two different backup tasks, separated with a different archive prefix. One backup task is scheduled to run very frequently (every 5 minutes), while the other is scheduled to run only once per day. Both backup tasks include a 'prune' command after the 'create' command to remove older archives.
The traceback in the first post happened only once, on a 'create' command for the frequent backup. The other backup, the daily one, which takes more time, while it always succeeds on the 'create' command, occasionally it fails on the 'prune' command, with the following traceback:
Traceback (most recent call last):
File "/usr/local/bin/attic", line 5, in <module>
pkg_resources.run_script('Attic==0.13', 'attic')
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
main()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
exit_code = archiver.run(sys.argv[1:])
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
return args.func(args)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 350, in do_prune
cache.commit()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 39] Directory not empty
So, I am thinking that it may exist some kind of race condition related to how cache transaction directories are handled by multiple processes trying to access the same repository. @jborg could that be the case?
Here is the output of 'attic list' for the two backup tasks just before the time that 'prune' command fails:
daily-archive_2014-12-27-02:15 Sat Dec 27 02:18:17 2014 frequent-archive_2014-12-27-02:16 Sat Dec 27 02:18:18 2014
As you can see, the 'create' commands of the two backup tasks finished very close in time. So, 'prune' commands started just after 'Sat Dec 27 02:18:17 2014' and 'Sat Dec 27 02:18:18 2014' respectively.
I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?
I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?
No, there isn't any kind of unreliable filesystem. EXT4 is used on both sides.
As I have posted on issue #167, my very frequent backup task was taking pretty much time because it executed one 'list' command before the 'create' command and one 'prune' command after the 'create' command. Now I have improved this by not calling the list operation. Since that change, I have not seen errors of the kind described on this issue. That makes me think that maybe these errors were more likely to happen when "overlapping" executions of 'list-create-prune' backup tasks were in action. Reducing the time needed for the backup task by actually reducing the chain 'list-create-prune' to 'create-prune' (with 'create' being pretty fast) makes it less possible to have overlapping executions. That said, I know that the operations acquire an exclusive lock to the repository, so that overlapping normally should not be a problem.
To clarify (as it is not visible from the 2 tracebacks):
The operation that fails in both cases is:
os.rename(os.path.join(self.path, 'txn.active'),
os.path.join(self.path, 'txn.tmp'))
I am still getting these tracebacks in my 'create & prune' frequent backup task on the repository mentioned above. The traceback on the 'prune' operation is somewhat different:
Traceback (most recent call last):
File "/usr/local/bin/attic", line 5, in <module>
pkg_resources.run_script('Attic==0.13', 'attic')
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
main()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
exit_code = archiver.run(sys.argv[1:])
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
return args.func(args)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 346, in do_prune
archive.delete(stats)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 333, in delete
self.cache.chunk_decref(items_id, stats)
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 198, in chunk_decref
self.begin_txn()
File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 96, in begin_txn
os.path.join(self.path, 'txn.active'))
OSError: [Errno 39] Directory not empty
On a side note, I wish the prune operation was done at the server side on a remote repository. If you have a big enough repository, it makes SO much traffic the way it is implemented right now and it is SO slow, that a frequent backup task is practically not feasible. Next invocations are waiting for the previous to complete. I also suspect that the traceback above are triggered somehow by these create/prune processes piling and "fighting" for getting access to the repo (although the locking mechanism should protect from this).
@Ernest0x about too much traffic / serverside prune: rather create a separate ticket about that.
I suspect this issue is caused by an internal inconsistency about whether a transaction is active or not. The code in question changed after 0.13, so it might be that this error does not happen any more.
I also did a fix related to transaction state of a remote repo server process that could be related, see there:
https://github.com/borgbackup/borg/blob/0b9dcbf8e529d0548d99d75c980c1913594d68ac/borg/repository.py#L174