dulwich
dulwich copied to clipboard
excessive memory usage cloning large remote repository
When cloning a large repository like git://github.com/lampepfl/dotty.git, Dulwich uses an excessive amount of memory.
Easily reproducible using:
./bin/dulwich clone git://github.com/lampepfl/dotty.git
The memory usage goes overboard in this line in _complete_thin_pack():
entries = list(indexer)
I'm seeing this behavior as well, even with small repositories. (~100 commits with around 10MB worth of data). According to a profiler I'm using, memory usage balloons to around half a gig during the object resolution phase. We're using this library in a memory sensitive environment, is there any plans for this issue? Otherwise, I could take a stab at it, though I'm not an expert on Git internals.
On Fri, Feb 15, 2019 at 09:10:04PM +0000, Matt wrote:
I'm seeing this behavior as well, even with small repositories. (~100 commits with around 10MB worth of data). According to a profiler I'm using, memory usage balloons to around half a gig during the object resolution phase. We're using this library in a memory sensitive environment, is there any plans for this issue? Otherwise, I could take a stab at it, though I'm not an expert on Git internals. I don't have concrete plans to work on this in the next month or so. Help debugging this would be great.
Half a gig for a repository of 10Mb seems very excessive. What operation are you running, just cloning? What's the size of the output of 'git fast-export --all' on the repository?
-- Jelmer Vernooij [email protected] PGP Key: https://www.jelmer.uk/D729A457.asc
What operation are you running, just cloning?
Yep, porcelain.clone(repo_url, repo_path)
.
The final memory usage ends up being 256MB, but internally jumps to ~478 once _follow_chain
is called.
What's the size of the output
A lot
git fast-export --all | wc -l
20868494
We're syncing state files, so the commits themselves are rather large, but the overall repo is considerably small.
$ du -sh test_repo/
14M test_repo/
I'll have some time to sit down with this next week.
Dulwich has a LRUCache for recently read objects; you may be able to reduce the memory consumption by reducing the number of objects in the LRU Cache. See dulwich/pack.py
I believe this is now mostly addressed. Please comment if you can still address it with master.