cpython
cpython copied to clipboard
gh-121267: Improve performance of tarfile (#121267)
Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved.
In my simple benchmark[1], this extra caching speeds up tarfile by 8x.
[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
- Issue: gh-121267
Can the cache get outdated during execution?
@nineteendo Yes. Suppose the passwd database changes during processing of a tar file. We have two options:
- Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.
- Keep the same uname as when the UID was first encountered, ignore the new change.
Python currently does (1) which comes at a steep (several hundred percent) performance cost as we need to re-read and parse the passwd/group database for every single file. Note also that we first stat the file, then read passwd so there is already a race condition in the code if we want to view it that way.
Doing (2) is much faster as I've shown. It's also what GNU tar does, so it's not a new idea.
Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.
But you are doing the cache at class level right? Which means it might not be a "single" file that has an inconsistent mapping. If it's a long-running process, the user could generate a tar file, then changed the mapping, and try to generate another one - but the name is cached and the change won't be reflected.
Yeah, that's what I was thinking too, would a new parameter be better?
def __init__(self, name=None, mode="r", fileobj=None, format=None,
tarinfo=None, dereference=None, ignore_zeros=None, encoding=None,
errors="surrogateescape", pax_headers=None, debug=None,
- errorlevel=None, copybufsize=None, stream=False):
+ errorlevel=None, copybufsize=None, stream=False, uname_cache=None,
+ gname_cache=None):
But you are doing the cache at class level right?
I'm sorry, that was a mistake. I intended the cache to be per TarFile instance, to make the lifetime of the cache clearly limited. I'd be happy to correct it. I agree that a process-global cache isn't a good idea.
@nineteendo I'm not sure I follow you. Do you mean to make the cachin opt-in? I was hoping that this could be turned on by default, so more code can benefit from the speedup.
Nevermind, just put it on the instance (there's already a cache for inodes):
https://github.com/python/cpython/blob/41397cee70f4933b539ab1a9232f04419ac8e2bb/Lib/tarfile.py#L1724-L1732
@nineteendo @gaogaotiantian Thanks for your feedback! I have pushed a fixed version now.
Next time please do not force push - it would be easier for us to review the code and the history.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.
Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.
And if you don't make the requested changes, you will be put in the comfy chair!
@gaogaotiantian: I have made the requested changes; please review again
Thanks for making the requested changes!
@gaogaotiantian: please review the changes made to this pull request.
@ethanfurman is the component owner of tar related stuff. I'll leave the decision for him. The code looks good to me, but I'm not sure if there is any concern specifically for the optimization.
Thanks @picnixz, I have applied your suggestions.
@ethanfurman, would you care to have a look at the patch? I think this performance gain would be quite beneficial for users of "tarfile".
@picnixz @gaogaotiantian @ethanfurman My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along? Thanks for the help.
My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along
I personally have no way to commit. And patches may be pending for a long time until a core dev accepts it. I'm not a tarfile expert so I think you'll need to wait until Ethan has time to review it.
@picnixz Thanks. I wasn't sure if people were waiting for me to do something at this point.
hello, not a core dev
i just noticed too that the performance of tarfile is horrible and found this ticket.
Commercial profiler run, using tarfile to tar the venv.
96 seconds to do the tarfile.add(), one third of that is looking up user/group info.
Thanks everyone for moving this along and sorry I missed it -- my wife was in the hospital when this first started.
Thanks for your help everyone! Ethan, I hope your wife is feeling better.
@hauntsaninja Should we backport the changes? (it may be nice and I don't think it breaks compatibility here)
Ethan, I hope your wife is feeling better
I hope too!
While it would be nice, it feels more like an enhancement and not a bug.
(And she is, thank you.)
We typically don't backport performance improvements. And hope things are well, Ethan!
it would be great to backport. it's a very small patch to make tarring twice as fast.
I wouldn't be surprised if this patch alone could save whole power-plant-worth-of-power from computers that are wasting time archiving stuff inefficiently. do we really have to wait for python 3.14 to land for the fix to be available?
I've had enough "simple" patches with unintended consequences that I am not willing to backport this one. If you would like to open a discourse thread about it and get consensus that this patch is fine to backport, I'll be happy to do so.