dulwich
dulwich copied to clipboard
repack implementation
Dulwich should have a way of repacking a repository.
+1
I've tried the initial version of the repack function. In https://github.com/jelmer/dulwich/issues/549 you wrote:
- it currently caches all objects in memory while it repacks
- it's not triggered automatically, you have to call it manually
My initial trial also shows what appear to be performance issues with what I thought might be a trivial scenario where an initial clone is followed by an immediate repack. It seems that a lot of time is spent iterating through the objects in the old packs then updating the pack:
2017-08-24 16:13:00.051Z:INFO:15732:__main__:321:git - Counting objects: 116428
2017-08-24 16:13:00.078Z:INFO:15732:__main__:321:git - Counting objects: 124009, done.
...
2017-08-24 16:13:00.259Z:INFO:15732:__main__:321:git - Compressing objects: 100% (24115/24115)
2017-08-24 16:13:00.263Z:INFO:15732:__main__:321:git - Compressing objects: 100% (24115/24115), done.
2017-08-24 16:13:03.021Z:INFO:15732:__main__:321:git - Total 124009 (delta 99887), reused 123908 (delta 99808)
2017-08-24 16:13:15.190Z:INFO:15732:dulwich.object_store:354:@@@ repack 1
2017-08-24 16:13:15.190Z:INFO:15732:dulwich.object_store:358:@@@ repack 2
2017-08-24 16:13:15.190Z:INFO:15732:dulwich.object_store:361:@@@ repack 3
2017-08-24 16:13:35.501Z:INFO:15732:dulwich.object_store:365:@@@ repack 4
2017-08-24 16:14:28.001Z:INFO:15732:dulwich.object_store:368:@@@ repack 5
2017-08-24 16:14:28.001Z:INFO:15732:dulwich.object_store:371:@@@ repack 6
2017-08-24 16:14:28.042Z:INFO:15732:dulwich.object_store:374:@@@ repack 7
2017-08-24 16:14:35.979Z:INFO:15732:__main__:321:git - Counting objects: 1
2017-08-24 16:14:36.514Z:INFO:15732:__main__:321:git - Counting objects: 103356, done.
...
2017-08-24 16:14:36.621Z:INFO:15732:__main__:321:git - Compressing objects: 100% (57708/57708), done.
2017-08-24 16:14:37.044Z:INFO:15732:__main__:321:git - Total 103356 (delta 45615), reused 103235 (delta 45552)
2017-08-24 16:14:57.030Z:INFO:15732:dulwich.object_store:354:@@@ repack 1
2017-08-24 16:14:57.030Z:INFO:15732:dulwich.object_store:358:@@@ repack 2
2017-08-24 16:14:57.030Z:INFO:15732:dulwich.object_store:361:@@@ repack 3
2017-08-24 16:15:09.760Z:INFO:15732:dulwich.object_store:365:@@@ repack 4
2017-08-24 16:15:25.288Z:INFO:15732:dulwich.object_store:368:@@@ repack 5
2017-08-24 16:15:25.289Z:INFO:15732:dulwich.object_store:371:@@@ repack 6
2017-08-24 16:15:25.300Z:INFO:15732:dulwich.object_store:374:@@@ repack 7
I added instrumentation as shown:
+ _logger.info('@@@ repack 1')
loose_objects = set()
for sha in self._iter_loose_objects():
loose_objects.add(self._get_loose_object(sha))
+ _logger.info('@@@ repack 2')
objects = {(obj, None) for obj in loose_objects}
old_packs = list(self.packs)
+ _logger.info('@@@ repack 3')
for pack in old_packs:
objects.update((obj, None) for obj in pack.iterobjects())
+ _logger.info('@@@ repack 4')
self.add_objects(objects)
+ _logger.info('@@@ repack 5')
for obj in loose_objects:
self._remove_loose_object(obj.id)
+ _logger.info('@@@ repack 6')
for pack in old_packs:
self._remove_pack(pack)
+ _logger.info('@@@ repack 7')
return len(objects)
A second issue is that after calling repack()
, I seem to have lost HEAD, and am unable to find what should be a valid SHA:
$ git log
fatal: bad object HEAD
$ git show 62bc27
fatal: bad object 62bc27
$ ls -lR objects/
objects/:
total 8
drwxrwxr-x 2 earl earl 4096 Aug 24 09:14 info/
drwxrwxr-x 2 earl earl 4096 Aug 24 09:15 pack/
objects/info:
total 0
objects/pack:
total 0
Is this unexpected or am I mistaken in thinking that repack()
is a drop-in replacement for pack_loose_objects()
?
@@ -153,9 +156,9 @@ class _Repo(_Gitorious):
# Try using repack to control the number of open file descriptors
# that dulwich requires:
#
- # https://github.com/jelmer/dulwich/issues/281
+ # https://github.com/jelmer/dulwich/issues/549
- self.__repo.object_store.pack_loose_objects()
+ self.__repo.object_store.repack()
(follow up for that last comment in #552)