dulwich icon indicating copy to clipboard operation
dulwich copied to clipboard

excessive memory usage cloning large remote repository

Open jelmer opened this issue 6 years ago • 5 comments

When cloning a large repository like git://github.com/lampepfl/dotty.git, Dulwich uses an excessive amount of memory.

Easily reproducible using:

./bin/dulwich clone git://github.com/lampepfl/dotty.git

jelmer avatar Apr 11 '18 23:04 jelmer

The memory usage goes overboard in this line in _complete_thin_pack():

    entries = list(indexer)

jelmer avatar Apr 11 '18 23:04 jelmer

I'm seeing this behavior as well, even with small repositories. (~100 commits with around 10MB worth of data). According to a profiler I'm using, memory usage balloons to around half a gig during the object resolution phase. We're using this library in a memory sensitive environment, is there any plans for this issue? Otherwise, I could take a stab at it, though I'm not an expert on Git internals.

staticfox avatar Feb 15 '19 21:02 staticfox

On Fri, Feb 15, 2019 at 09:10:04PM +0000, Matt wrote:

I'm seeing this behavior as well, even with small repositories. (~100 commits with around 10MB worth of data). According to a profiler I'm using, memory usage balloons to around half a gig during the object resolution phase. We're using this library in a memory sensitive environment, is there any plans for this issue? Otherwise, I could take a stab at it, though I'm not an expert on Git internals. I don't have concrete plans to work on this in the next month or so. Help debugging this would be great.

Half a gig for a repository of 10Mb seems very excessive. What operation are you running, just cloning? What's the size of the output of 'git fast-export --all' on the repository?

-- Jelmer Vernooij [email protected] PGP Key: https://www.jelmer.uk/D729A457.asc

jelmer avatar Feb 16 '19 00:02 jelmer

What operation are you running, just cloning?

Yep, porcelain.clone(repo_url, repo_path).

The final memory usage ends up being 256MB, but internally jumps to ~478 once _follow_chain is called.

image

image

image

What's the size of the output

A lot

git fast-export --all | wc -l
20868494

We're syncing state files, so the commits themselves are rather large, but the overall repo is considerably small.

$ du -sh test_repo/
14M     test_repo/

I'll have some time to sit down with this next week.

staticfox avatar Feb 16 '19 01:02 staticfox

Dulwich has a LRUCache for recently read objects; you may be able to reduce the memory consumption by reducing the number of objects in the LRU Cache. See dulwich/pack.py

jelmer avatar Feb 16 '19 20:02 jelmer

I believe this is now mostly addressed. Please comment if you can still address it with master.

jelmer avatar Jan 15 '23 21:01 jelmer