EasyNMT icon indicating copy to clipboard operation
EasyNMT copied to clipboard

It keeps on downloading the models again and again when start a new operation.

Open vishwas31 opened this issue 3 years ago • 17 comments

vishwas31 avatar Jul 29 '21 09:07 vishwas31

Not sure what you mean? What is "a new operation"? How does your code look like?

Models should be downloaded once and cached on disc.

nreimers avatar Jul 29 '21 13:07 nreimers

Thing is I ran code one time and it downloaded the model, but then again I run same lines of code it downloads the model again. I run this in jupyter notebook and to be clear, I do not restart my kernel.

from easynmt import EasyNMT
model = EasyNMT('opus-mt', cache_folder= "/home/Path/.cache/huggingface/transformers")
res = model.translate(name, target_lang='en')

vishwas31 avatar Jul 29 '21 13:07 vishwas31

Have you tried to set it to a different cache_folder or to not pass the cache_folder parameter?

nreimers avatar Jul 29 '21 14:07 nreimers

I tried, but no help.

vishwas31 avatar Jul 29 '21 14:07 vishwas31

What is "name" here? Can you post an example? Is it the exact same input both times?

Note that the opus-mt model are actually many different models, one for each language direction. Was maybe a different opus-mt model downloaded for a different language direction?

nreimers avatar Jul 29 '21 14:07 nreimers

"name" is basically a text which I want to translate. I have a csv of text which are in different languages. Yes..the same exact inputs. My data has mostly Korean, Japanese, Chinese, Russian, Spanish texts. And everytime it downloads all the models again and again.

vishwas31 avatar Jul 29 '21 14:07 vishwas31

I can confirm too that this seems somewhat broken with opus-mt models cache aint working

R4ZZ3 avatar Aug 04 '21 13:08 R4ZZ3

Same issue, it downloads every time, not just first

zubairahmed-ai avatar Aug 05 '21 06:08 zubairahmed-ai

@vishwas31 @R4ZZ3 @zubairahmed-ai

I'm sadly not able to re-produce the error. Model is downloaded only once: https://colab.research.google.com/drive/1RgsdOylqV2aYKuKNRXWElqU7wePony5w?usp=sharing

Could you share some self-contained code demonstrating the issue?

nreimers avatar Aug 05 '21 09:08 nreimers

Have you tried to do this in local system? Because demonstrating means have to record video.

vishwas31 avatar Aug 10 '21 09:08 vishwas31

cache_folder olny cached "easynmt.json" , you should modify this json file and specified your local model path like this:

{ "model_class": "easynmt.models.AutoModel.AutoModel", "model_args": { "model_name": "/content/drive/MyDrive/Kaggle/modelcache/m2m100_418M", "tokenizer_args": {"model_max_length": 1024} }, "lang_pairs": ["ru-id", "ru-ms", "id-ru", "ms-ru"] }

btw, you should clone the full model files first. on colab, you can try this:

!apt-get install git-lfs !git lfs install !git clone https://huggingface.co/facebook/m2m100_418M

finally, you can load your local model like this:

easynmt.json at "/content/drive/MyDrive/Kaggle/modelcache/m2m100_418m/easynmt.json", please modified this json file to load your local model first

model = EasyNMT('m2m_100_418M', cache_folder= "/content/drive/MyDrive/Kaggle/modelcache")

js-lan avatar Aug 30 '21 12:08 js-lan

I also have this problem while translating from es-en.

Simantakaushik avatar Dec 09 '21 03:12 Simantakaushik

Figured out the problem. This happens if your download is not complete within the Gunicorn TIMEOUT period. Set your TIMEOUT variable to something large like 10 minutes (600 seconds) so the 300 MB file can be downloaded and installed within the timeframe.

Simantakaushik avatar Dec 09 '21 06:12 Simantakaushik

@vishwas31 @R4ZZ3 @zubairahmed-ai

I'm sadly not able to re-produce the error. Model is downloaded only once: https://colab.research.google.com/drive/1RgsdOylqV2aYKuKNRXWElqU7wePony5w?usp=sharing

Could you share some self-contained code demonstrating the issue?

I encountered the same problem. When translating a set of documents from various languages to English by using the 'opus-mt' model in loops, the machine keeps downloading new models to continue.

For demonstration, I worked further on @nreimers example with lines of commands appended: https://colab.research.google.com/drive/1pkcBEsX3OHA1LM52oO2LhKZyVJgMWmAO?usp=sharing

Would there exists more efficient ways to do the translation in batches (from any languages to English) without keep downloading models while occupying computer memory?

edmangog avatar Dec 11 '21 10:12 edmangog

Hi @edmangog The opus-mt uses a different model for every language direction. So if you translate from 10 languages to English, you must use 10 different models.

Models are downloaded once and then stored on disc. When you create the opus-mt translation object, you can specify how many models to keep in memory by changing max_loaded_models. By default, it keeps the 10 most recent models in memory.

If you pass a list of sentences to translate, sentences are grouped by their language to minimize loading of models.

nreimers avatar Dec 11 '21 10:12 nreimers

Thanks, @nreimers!

How does the abovementioned grouping works programmatically? I can't seem to find relevant info in this GitHub repo.

Actually, in my use case, there are thousands of documents in 40+ languages and I want to translate them all into English. Let say, the max_load_models = 10. Do I need to first manually group those same language documents and rank them by group size before passing them to the machine, in order to minimize the reloading model times.

edmangog avatar Dec 11 '21 11:12 edmangog

Have a look here: https://github.com/UKPLab/EasyNMT/blob/5ea48f5fb68be9e4be4b8096800e32b8ad9a45df/easynmt/EasyNMT.py#L135

You can pass all documents to encode().

Otherwise, to be sure, you can also first run language detection (model.language_detection(text)) and then do the grouping by yourself.

nreimers avatar Dec 13 '21 08:12 nreimers