djl icon indicating copy to clipboard operation
djl copied to clipboard

Please report location when HuggingFaceTokenizer.newInstance fails with I/O error

Open wnm3 opened this issue 1 year ago • 7 comments

I’m using code like:

    static String DJL_MODEL = "intfloat/multilingual-e5-base";
    static String DJL_PATH = "djl://ai.djl.huggingface.pytorch/" + DJL_MODEL;
    static private HuggingFaceTokenizer huggingFaceTokenizer;
...
    static private HuggingFaceTokenizer getHuggingFaceTokenizer() {
        if (huggingFaceTokenizer == null) {
            huggingFaceTokenizer = HuggingFaceTokenizer.newInstance(DJL_MODEL,
                getDJLConfig());
        }
        return huggingFaceTokenizer;
    }

and it works fine running from the command line. However, when executing the same code in a Docker container I am getting an error:

I/O error Permission denied (os error 13)
RuntimeException: I/O error Permission denied (os error 13)
ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizer(Native Method)
ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.newInstance(HuggingFaceTokenizer.java: 109)

I’m sure the issue has to do with permissions in the container, but I have no idea what happens when loading the instance so I don’t know which directory is needed to have its attributes changed. Unfortunately, there method resolves to a native library so I can’t figure out what is causing this error.

Please let me know what is attempting to be retrieved / saved. The same code works fine in an OpenLiberty server running outside the container.

wnm3 avatar Feb 14 '24 22:02 wnm3

@wnm3 When use use "intfloat/multilingual-e5-base", you are actually download the tokenizer from Huggingface HUB. It will save the tokenizer.json file in $HUGGINGFACE_HUB_CACHE directory (default ~/.cache/huggingface/hub). See: https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache

Please make sure the cache folder is writeable. You can also manually set env var to point it to a desired directory. see example: https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/Dockerfile#L44

frankfliu avatar Feb 14 '24 22:02 frankfliu

I noticed you are using DJL model zoo, any reason you manually create your tokenizer?

If you use our built-in TranslatorFactory, we will load the tokenizer from the model directory instead of download from Huggingface hub. See: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/main/java/ai/djl/huggingface/translator/TextEmbeddingTranslatorFactory.java#L60-L64

frankfliu avatar Feb 14 '24 22:02 frankfliu

I'd been handed code that worked, so I didn't investigate alternatives ;^) -- the API you showed has more parameters. I'm happy to switch if you could provide a working example, as I'm not familiar with the input/output nor the model since we are passing a single string and map of options. Thanks in advance.

wnm3 avatar Feb 15 '24 13:02 wnm3

I tried setting the env variables (the article showed HF_HUB_CACHE rather than HUGGINGFACE_HUB_CACHE) to a directory with rwx for all levels and still get the I/O error -- do I need to create subdirectories with all permissions? Still using the download...

default@37e5f86e600b:/opt/ol/wlp/output/defaultServer$ pwd
/opt/ol/wlp/output/defaultServer
default@37e5f86e600b:/opt/ol/wlp/output/defaultServer$ ls -l |grep data
drwxrwxrwx. 44    1000 1000       8192 Feb 14 06:33 data
default@37e5f86e600b:/opt/ol/wlp/output/defaultServer$ env |grep HUB
HF_HUB_CACHE=/opt/ol/wlp/output/defaultServer/data
HUGGINGFACE_HUB_CACHE=/opt/ol/wlp/output/defaultServer/data

wnm3 avatar Feb 15 '24 13:02 wnm3

Here is an example take String as input and output a float[]: https://github.com/deepjavalibrary/djl-demo/blob/master/huggingface/nlp/src/main/java/com/examples/TextEmbedding.java

You can set Tokenizer parameters like this: https://github.com/deepjavalibrary/djl/blob/64c1b969feeafc19cc9bc8c7f4cc2e6f46fccce6/extensions/tokenizers/src/test/java/ai/djl/huggingface/tokenizers/TextEmbeddingTranslatorTest.java#L207

The huggingface tokenizer source code is here: https://github.com/huggingface/tokenizers/tree/main/tokenizers

frankfliu avatar Feb 15 '24 15:02 frankfliu

Will the API using newInstance I first asked about first check the .cache for the existence of the model before attempting to download it? If so, I can build the container with the model set up in ~/.cache...

wnm3 avatar Feb 15 '24 15:02 wnm3

I don't recommend you setup ~/.cache directory. I's quite complicated and HF can change their folder structure in future.

Setting HF_HUB_CACHE should be the right solution. Can you check if you point HF_HUB_CACHE to a new folder, does the folder get created?

frankfliu avatar Feb 15 '24 16:02 frankfliu

@frankfliu thanks for your suggestions. I'm stuck because the newInstance for the intfloat/multilingual-e5-base calls JNI code that is attempting to save downloaded content to ~/.cache/huggingface/hub and I've tried environment variables to point it elsewhere but neither HF_HUB_CACHE nor HUGGINGFACE_HUB_CACHE seems to work. If the JNI code is in a github I can try to read to find what controls this location. I'm attempting to load the e5 model in a docker container and it fails with permissions issues for the home directory. Feel free to reach out directly if desired [email protected] / [email protected]

wnm3 avatar Feb 23 '24 15:02 wnm3

The env variable I needed was HF_HOME -- it was in the Docker document you'd provided. Thank you.

wnm3 avatar Feb 23 '24 16:02 wnm3

It's a bit confusion.

Based Huggingface website, HUGGINGFACE_HUB_CACHE is the old env var, Huggingface has changed it to HF_HUB_CACHE.

However, I checked their rust implementation, it seems only HF_HOME is implemented in their rust library: https://github.com/huggingface/hf-hub/blob/main/src/lib.rs#L195

frankfliu avatar Feb 23 '24 17:02 frankfliu

Thank you Frank. Ideally, the exception thrown could identify where it was attempting to write to, and possibly reference the env var's name ;^) I'll close this issue.

wnm3 avatar Feb 23 '24 17:02 wnm3