What does this PR do?

When doing model loading with the various tools within transformers there's actually a lot of duplicate HEAD calls that cost network time and duplicate usage in an unecessary fashion.

For instance when doing a simple pipeline(model="gpt2")

You're getting

Fetching https://huggingface.co/gpt2/resolve/main/config.json
Downloading: 100%|████████████████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<00:00, 1.20MB/s]
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/pytorch_model.bin
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/vocab.json
----
Fetching https://huggingface.co/gpt2/resolve/main/merges.txt
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer.json
----
Fetching https://huggingface.co/gpt2/resolve/main/added_tokens.json
----
Fetching https://huggingface.co/gpt2/resolve/main/special_tokens_map.json
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json

So you're doing a HEAD 4 to 5 times the config.json, 3 times to tokenizer_config.json. Each of these is doing a call on the HUB which requires loading some resources which could be avoided. @Xcid

In addition it adds a lot of noise within our logs since we're getting a lot of random multiple HEAD calls for the actual same code being run.

Fixing it "cleanly" is hard, since there are many pathways to load the various elements and checking every single path is hard.

The proposed fix is to simply introduce a timed_cache wrapper on top of the request.head function.

We can keep a very short ttl since it's only to reduce duplicates when the model is unlikely to have changed. We need to keep in mind jupyter or long lived users, so we need a TTL so that model updates can still be seen and downloaded.

In addition to that, it seems each code path calls the HEAD part with a different user-agent which (afaik) makes it harder to understand our user's usage.

This is a tentative PR, proposed to reduce redundant network calls. If this is thought as a correct direction I will then add unit testing for this timed_cache function.

After the PR:

----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/pytorch_model.bin
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/vocab.json
----
Fetching https://huggingface.co/gpt2/resolve/main/merges.txt
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer.json
----
Fetching https://huggingface.co/gpt2/resolve/main/added_tokens.json
----
Fetching https://huggingface.co/gpt2/resolve/main/special_tokens_map.json

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@sgugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Aug 02 '22 16:08 Narsil

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Aug 02 '22 16:08 HuggingFaceDocBuilderDev

Ok, marking as draft while the other PR is being worked on.

Aug 02 '22 17:08 Narsil

@sgugger If you want to take a look a the tests.

Right now the tests are failing since the cache was written for the previous HEAD code. We can do the cache here or on huggingface_hub, I am not sure where is the most appropriate.

This PR now hold a few (relatively hackish) attempts to preserve information from the user-agent.

The one thing that's surpised me, is that the pipeline load the model use from_auto_class/False because it looks directly within the config to fetch the model with the correct head. While it's technically correct, I am not sure if that's correct "intentionally" or for telemetry purposes, since it is actually using a lot of magic.

WDYT ?

Aug 08 '22 12:08 Narsil

The caching part should go more in the huggingface_hub IMO, especially now that we rely on it for everything. But I also think people might have strong opinion on it (if a user just updated a model and don't see it downloaded for instance, they'll be mad and won't necessarily understand there is some cache to clear).

I'll wait for you work for the cache, that should clear the need for it, and the tests here should cover, I'll update this PR to make sure user-agent is correct afterwards.

I had a TTL of 10s for the cache which is largely enough in most cases (well you still would have multiple hit if you were actually downloading the files but .. )

Removing the need for the cache is the best solution, so let's go with that for now.

Aug 08 '22 12:08 Narsil

@Narsil in case it's relevant, I was observing a significant perf disparity with the following repro (all files have been cached) if I specify TRANSFORMERS_OFFLINE=1 or not:

test.py:

from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json

real    0m1.759s
user    0m1.815s
sys     0m2.443s


$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json


real    0m6.187s
user    0m2.184s
sys     0m2.248s

I then searched around and stumbled upon your PR. I tried patching it down into a venv linked to my repo and saw that the time was roughly the same:

(venv) ankur-m1:~/projects/layoutlm ankur$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json

real    0m6.034s
user    0m2.019s
sys     0m2.346s

I may have completely misunderstood the purpose of this change, so please ignore me if this comment is irrelevant, but since I was poking around with perf in a similar fashion, I thought I'd share if helpful!

Aug 09 '22 01:08 ankrgyl

@ankrgyl

This PR caches files for a very small amount of time (10s) because most of the time users will want the new models when they exist.

You can try to increase the TTL if you want. Also the cache isn't kept through different python sessions.

You may want to check were the overhead of network is occurring too, if could be DNS issues on your end, or just high latency with the HF servers.

Aug 09 '22 16:08 Narsil

Ahh, okay, that makes sense. Let me dig around a bit further with my repro, and see if I can find anything useful. Thanks for the pointers.

Aug 09 '22 16:08 ankrgyl

Btw @sgugger is working on a better fix we should reduce the amount of network calls as close to 1 as possible.

Aug 09 '22 16:08 Narsil

@sgugger if it's helpful to have a guinea pig test your PR in the wild, I'm happy to help! For background context, the reason I'm trying to optimize this cold start is that I'm trying to use transformers in a command line utility where cold start time matters quite a bit.

Aug 09 '22 16:08 ankrgyl

The PR is #18534, but nothing will beat using offline mode with the model cached, since you are then doing 0 calls to the API.

Aug 09 '22 16:08 sgugger

Thanks for the pointer @sgugger. I agree, however, the disparity exists even if you pin the revision:

$ cat transformers_test.py
from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad', revision="1b9d42b637aed70c9f3cd27e13b66ee9f847ed03")

$ time python transformers_test.py

real    0m5.680s
user    0m1.694s
sys     0m2.252s

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py

real    0m1.321s
user    0m1.653s
sys     0m1.997s

While playing around with this, I noticed issue #18537, which might be leading to extra network calls (since the revision isn't pinned for the model) in my repro.

Apologies if I'm missing something obvious here, but I'd expect that (a) if the revision is specified and (b) it's cached, then there shouldn't be any network calls.

Aug 09 '22 16:08 ankrgyl

If the revision is specified as a commit sha, then yes, the cache should be used. This is not implemented by #18534 however, but could be some follow up work.

The only exception is for files that don't exist, as we don't know if they haven't been downloaded yet or if they are not present. That's why we still have extra calls in #18534 and would still have extra calls in this case as well.

Aug 09 '22 17:08 sgugger

Got it, I pulled down your PR and ran the same test, and saw much better results:

$ time python transformers_test.py

real    0m2.384s
user    0m1.869s
sys     0m2.222s

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py

real    0m1.588s
user    0m1.722s
sys     0m2.229s

I'd be happy to help with the follow up work/exploration if helpful. I think you could theoretically handle the "all files downloaded" case too, by caching a file that simply marks that you've downloaded all files associated with a revision.

Aug 09 '22 17:08 ankrgyl

Yes there is probably some way to cache that the file does not exist for a given commit sha. Pinged a few people internally to see if they like that idea and will let you know when I hear back!

Aug 09 '22 17:08 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 03 '22 15:09 github-actions[bot]

Unstale. I'll come back to this at some poind.

Sep 27 '22 11:09 Narsil

There is only one call to head now once the model is cached @Narsil

Sep 27 '22 11:09 sgugger

You're right, closing.

Sep 27 '22 12:09 Narsil

transformers
transformers copied to clipboard

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup.

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup.

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard