cpython gh-121267: Improve performance of tarfile (#121267)

Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

Issue: gh-121267

Jul 02 '24 13:07 jforberg

All commit authors signed the Contributor License Agreement.

Jul 02 '24 13:07 ghost

Can the cache get outdated during execution?

Jul 02 '24 15:07 nineteendo

@nineteendo Yes. Suppose the passwd database changes during processing of a tar file. We have two options:

Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.
Keep the same uname as when the UID was first encountered, ignore the new change.

Python currently does (1) which comes at a steep (several hundred percent) performance cost as we need to re-read and parse the passwd/group database for every single file. Note also that we first stat the file, then read passwd so there is already a race condition in the code if we want to view it that way.

Doing (2) is much faster as I've shown. It's also what GNU tar does, so it's not a new idea.

Jul 02 '24 16:07 jforberg

Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.

But you are doing the cache at class level right? Which means it might not be a "single" file that has an inconsistent mapping. If it's a long-running process, the user could generate a tar file, then changed the mapping, and try to generate another one - but the name is cached and the change won't be reflected.

Jul 02 '24 17:07 gaogaotiantian

Yeah, that's what I was thinking too, would a new parameter be better?

 def __init__(self, name=None, mode="r", fileobj=None, format=None,
         tarinfo=None, dereference=None, ignore_zeros=None, encoding=None,
         errors="surrogateescape", pax_headers=None, debug=None,
-        errorlevel=None, copybufsize=None, stream=False):
+        errorlevel=None, copybufsize=None, stream=False, uname_cache=None,
+        gname_cache=None):

Jul 02 '24 17:07 nineteendo

But you are doing the cache at class level right?

I'm sorry, that was a mistake. I intended the cache to be per TarFile instance, to make the lifetime of the cache clearly limited. I'd be happy to correct it. I agree that a process-global cache isn't a good idea.

@nineteendo I'm not sure I follow you. Do you mean to make the cachin opt-in? I was hoping that this could be turned on by default, so more code can benefit from the speedup.

Jul 02 '24 21:07 jforberg

Nevermind, just put it on the instance (there's already a cache for inodes):

https://github.com/python/cpython/blob/41397cee70f4933b539ab1a9232f04419ac8e2bb/Lib/tarfile.py#L1724-L1732

Jul 03 '24 04:07 nineteendo

@nineteendo @gaogaotiantian Thanks for your feedback! I have pushed a fixed version now.

Jul 03 '24 20:07 jforberg

Next time please do not force push - it would be easier for us to review the code and the history.

Jul 03 '24 20:07 gaogaotiantian

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

Jul 03 '24 20:07 bedevere-app[bot]

@gaogaotiantian: I have made the requested changes; please review again

Jul 03 '24 20:07 jforberg

Thanks for making the requested changes!

@gaogaotiantian: please review the changes made to this pull request.

Jul 03 '24 20:07 bedevere-app[bot]

@ethanfurman is the component owner of tar related stuff. I'll leave the decision for him. The code looks good to me, but I'm not sure if there is any concern specifically for the optimization.

Jul 03 '24 23:07 gaogaotiantian

Thanks @picnixz, I have applied your suggestions.

Jul 08 '24 13:07 jforberg

@ethanfurman, would you care to have a look at the patch? I think this performance gain would be quite beneficial for users of "tarfile".

Jul 18 '24 13:07 jforberg

@picnixz @gaogaotiantian @ethanfurman My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along? Thanks for the help.

Aug 24 '24 13:08 jforberg

My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along

I personally have no way to commit. And patches may be pending for a long time until a core dev accepts it. I'm not a tarfile expert so I think you'll need to wait until Ethan has time to review it.

Aug 24 '24 13:08 picnixz

@picnixz Thanks. I wasn't sure if people were waiting for me to do something at this point.

Aug 24 '24 14:08 jforberg

hello, not a core dev

i just noticed too that the performance of tarfile is horrible and found this ticket.

Commercial profiler run, using tarfile to tar the venv. 96 seconds to do the tarfile.add(), one third of that is looking up user/group info.

Oct 30 '24 18:10 morotti

Thanks everyone for moving this along and sorry I missed it -- my wife was in the hospital when this first started.

Oct 30 '24 22:10 ethanfurman

Thanks for your help everyone! Ethan, I hope your wife is feeling better.

Oct 30 '24 22:10 jforberg

@hauntsaninja Should we backport the changes? (it may be nice and I don't think it breaks compatibility here)

Ethan, I hope your wife is feeling better

I hope too!

Oct 30 '24 23:10 picnixz

While it would be nice, it feels more like an enhancement and not a bug.

(And she is, thank you.)

Oct 30 '24 23:10 ethanfurman

We typically don't backport performance improvements. And hope things are well, Ethan!

Oct 30 '24 23:10 hauntsaninja

it would be great to backport. it's a very small patch to make tarring twice as fast.

I wouldn't be surprised if this patch alone could save whole power-plant-worth-of-power from computers that are wasting time archiving stuff inefficiently. do we really have to wait for python 3.14 to land for the fix to be available?

Oct 31 '24 12:10 morotti

I've had enough "simple" patches with unintended consequences that I am not willing to backport this one. If you would like to open a discourse thread about it and get consensus that this patch is fine to backport, I'll be happy to do so.

Oct 31 '24 22:10 ethanfurman

cpython cpython copied to clipboard

gh-121267: Improve performance of tarfile (#121267)

cpython
cpython copied to clipboard