aaw-kubeflow-containers Add Pretrained PyTorch TorchVision Models

I made a pull request https://github.com/StatCan/aaw-kubeflow-containers/pull/302 to add pretrained PyTorch models to the jupyterlab-pytorch image. Please let me know if you have any comments or if there are any problems that need to be fixed.

Dec 09 '21 22:12 StanHatko

@blairdrummond

Dec 09 '21 22:12 StanHatko

Here is the list of URLs to be mirrored. An alternative method is to simply wget these into the proper directory. The URLs are:

https://download.pytorch.org/models/regnet_y_400mf-c65dace8.pth
https://download.pytorch.org/models/regnet_y_800mf-1b27b58c.pth
https://download.pytorch.org/models/regnet_y_1_6gf-b11a554e.pth
https://download.pytorch.org/models/regnet_y_3_2gf-b5a9779c.pth
https://download.pytorch.org/models/regnet_y_8gf-d0d0e4a8.pth
https://download.pytorch.org/models/regnet_y_16gf-9e6ed7dd.pth
https://download.pytorch.org/models/regnet_y_32gf-4dee3f7a.pth
https://download.pytorch.org/models/regnet_x_400mf-adf1edd5.pth
https://download.pytorch.org/models/regnet_x_800mf-ad17e45c.pth
https://download.pytorch.org/models/regnet_x_1_6gf-e3633e7f.pth
https://download.pytorch.org/models/regnet_x_3_2gf-f342aeae.pth
https://download.pytorch.org/models/regnet_x_8gf-03ceed89.pth
https://download.pytorch.org/models/regnet_x_16gf-2007eb11.pth
https://download.pytorch.org/models/regnet_x_32gf-9d47f8d0.pth
https://download.pytorch.org/models/resnet18-f37072fd.pth
https://download.pytorch.org/models/resnet34-b627a593.pth
https://download.pytorch.org/models/resnet50-0676ba61.pth
https://download.pytorch.org/models/resnet101-63fe2227.pth
https://download.pytorch.org/models/resnet152-394f9c45.pth
https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth
https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth
https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth
https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth
https://download.pytorch.org/models/shufflenetv2_x0.5-f707e7126e.pth
https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth
https://download.pytorch.org/models/squeezenet1_0-b66bff10.pth
https://download.pytorch.org/models/squeezenet1_1-b8a52dc0.pth
https://download.pytorch.org/models/vgg11-8a719046.pth
https://download.pytorch.org/models/vgg13-19584684.pth
https://download.pytorch.org/models/vgg16-397923af.pth
https://download.pytorch.org/models/vgg19-dcbb9e9d.pth
https://download.pytorch.org/models/vgg11_bn-6002323d.pth
https://download.pytorch.org/models/vgg13_bn-abd245e5.pth
https://download.pytorch.org/models/vgg16_bn-6c64b313.pth
https://download.pytorch.org/models/vgg19_bn-c79401a0.pth

Dec 10 '21 17:12 StanHatko

@StanHatko I have another idea which might be interesting; I could see us having a MinIO bucket or something within the cluster specifically for storing/caching files like this, so that downloads would be very fast.

Typically it's good to keep the docker images small, as that affects boot time and other things, but maybe an in-cluster mirror would be useful?

@brendangadd you have any thoughts on this kind of caching?

Dec 10 '21 19:12 blairdrummond

That could work, one disadvantage is that the URLs are hardcoded in the PyTorch package (see for example https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py near the top). So users will have to manually configure downloads of pretrained models (instead of using the package like they would on a regular computer which automatically downloads the pretrained model if requested and not found on disk), unless there is a way to override that or to intercept such requests and direct them to the Artifactory or something else.

Dec 10 '21 20:12 StanHatko

@brendangadd if we want to try some crazy stuff, I think EnvoyFilters can do interception at that level

https://istio.io/latest/docs/reference/config/networking/envoy-filter/

Dec 10 '21 20:12 blairdrummond

From https://pytorch.org/vision/master/models.html it says that the TORCH_HOME environment variable can be set to specify the cache directory.

Instancing a pre-trained model will download its weights to a cache directory. This directory can be set using the TORCH_HOME environment variable. See torch.hub.load_state_dict_from_url() for details.

From https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url it includes a parameter to remap storage locations.

map_location (optional) – a function or a dict specifying how to remap storage locations (see torch.load)

The documentation for torch.load says:

If map_location is a callable, it will be called once for each serialized storage with two arguments: storage and location. The storage argument will be the initial deserialization of the storage, residing on the CPU. Each serialized storage has a location tag associated with it which identifies the device it was saved from, and this tag is the second argument passed to map_location. The builtin location tags are 'cpu' for CPU tensors and 'cuda:device_id' (e.g. 'cuda:2') for CUDA tensors. map_location should return either None or a storage. If map_location returns a storage, it will be used as the final deserialized object, already moved to the right device. Otherwise, torch.load() will fall back to the default behavior, as if map_location wasn’t specified.

If TORCH_HOME can be pointed at a fast read-only SSD storage accessible from all nodes it might do the job.

Dec 10 '21 20:12 StanHatko

Anyway, the problem with these solutions (like creating fast read-only storage accessible from all nodes and point TORCH_HOME there or intercepting URLs with EnvoyFilters) is that only the AAW administrators can create such a solution, I cannot do it myself.

Dec 10 '21 20:12 StanHatko

If we are OK with the AAW being different from home computers in this regard, the simplest solution may be to mirror these pretrained model URLs in the Artifactory and clearly document that for the pretrained models (both torchvision and others like word embeddings). Then we could have a small script in the image to download requested pretrained models, for example:

download-torchvision-model.sh resnet18

This script download-torchvision-model.sh will pull the requested torchvision model from the Artifactory and save it in the correct directory that torchvision checks.

Dec 10 '21 20:12 StanHatko

To avoid greatly increasing the size of the image and long build times, I think the Artifactory approach is better. Please mirror the pretrained model URLs above in the Artifactory. Once that's done I can create a small script download-torchvision-model.sh described above, add that script to the image, and add it to the documentation.

Dec 14 '21 20:12 StanHatko

@bryanpaget this is another possibly cool idea, as it makes pre-trained models available for protected-b notebooks

Jan 25 '22 19:01 blairdrummond

Another source of pretrained model weights we discussed yesterday was https://huggingface.co/models, also with a predefined list of acceptable models.

Jan 26 '22 18:01 ToucheSir

We just need to gather a list of URLs to mirror in addition to the ones above (the huggingface.co site has 903 pages of models, but at least we can mirror the common and most important models from there). I'll post some additional word embedding URLs below.

An Artifactory administrator simply needs to add these URLs to Artifactory. Once I have the URLs I can make the small script mentioned above, a better interface may be ./download-pretrained-model.sh torch-resnet18 or ./download-pretrained-model.sh fasttext-cc-fr. In the future if there's a way to intercept URL downloads on AAW and redirect them to Artifactory that would be even better, but for now the ./download-pretrained-model.sh should be good enough.

Jan 27 '22 15:01 StanHatko

The following URLs are from existing GitLab issues.

From https://fasttext.cc/docs/en/english-vectors.html (contains information and reference paper to cite if publishing paper based on these):

https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip

From https://fasttext.cc/docs/en/crawl-vectors.html (contains information and reference paper to cite if publishing paper based on these):

https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.bin.gz

From https://fasttext.cc/docs/en/aligned-vectors.html (contains information and reference papers to cite if publishing paper based on these):

https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.en.align.vec
https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.fr.align.vec

FastText for language detection https://fasttext.cc/docs/en/language-identification.html needs the following:

https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

Here are the URLs to mirror for the pretrained GloVe embeddings:

http://nlp.stanford.edu/data/glove.6B.zip (Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download))
http://nlp.stanford.edu/data/glove.42B.300d.zip (Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download))
http://nlp.stanford.edu/data/glove.840B.300d.zip (Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
http://nlp.stanford.edu/data/glove.twitter.27B.zip (Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download))

Jan 27 '22 15:01 StanHatko

Various models from HuggingFace people previously requested (click on "Files and Versions" to see the actual files, there's probably some good way to git clone these or something as that tab shows a git repo):

https://huggingface.co/bert-large-uncased
https://huggingface.co/camembert/camembert-large
https://huggingface.co/gpt
https://huggingface.co/gpt-medium
https://huggingface.co/gpt-large
https://huggingface.co/asahi417/tner-xlm-roberta-base-ontonotes5
https://huggingface.co/asahi417/tner-xlm-roberta-base-uncased-ontonotes5
https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl

Jan 27 '22 15:01 StanHatko

Yes, looks like the huggingface downloads can be done programmatically with git + git lfs:

Jan 27 '22 15:01 ToucheSir

This is good. Now we just need some way to contact the AAW Artifactory manager and ask them to mirror the above in a pretrained-packages folder in the Artifactory.

Jan 27 '22 16:01 StanHatko

What is the contact information for the AAW Artifactory manager?

Jan 27 '22 16:01 bryanpaget

@blairdrummond Do you know how to contact the AAW Artifactory manager?

Jan 27 '22 19:01 StanHatko

@StanHatko @bryanpaget I think @Jose-Matsuda can log in. We are the owners of the AAW Artifactory. Once Bryan has his accounts we'll be able to look into this

Jan 27 '22 19:01 blairdrummond

We learned today that a subset of weights from PyTorch Hub and HuggingFace's model list are already mirrored on Artifactory, just not the one available on AAW. @EkramulHoque found an internal ticket about it too.

Jan 27 '22 19:01 ToucheSir

@blairdrummond we have found some pre-trained transformer models already downloaded at this Artifactory on NetA.

https://artifactory.statcan.ca:8443/artifactory/webapp/#/artifacts/browse/tree/General/generic-local/transformers-model

will it be possible to make a copy of this for the AAW artifactory?

Jan 27 '22 19:01 EkramulHoque

I don't see why we would need to transfer back data from Net A Artifactory to the AAW Artifactory. Wouldn't it be easier to just add the URL to mirror on the AAW Artifactory? Artifactory was specifically built for this job, to mirror repositories and objects.

Jan 28 '22 15:01 StanHatko

As per discussion at today's technical elaboration (CC @bryanpaget @Jose-Matsuda )

Investigate the file types and security risk. .pth files are pickle, which are unsafe, we need to assess the types of artifacts under discussion (.npy, .pth, .zip, etc) and see what's up
Look at the licenses applicable on the model repos?
See who controls the endpoints and what the governance is. Is Facebook/PyTorch responsible for the official PyTorch models? Are any user contributed? We will have a better time with trusted sources than user contributed (untrusted) ones.

We can investigate these and compile a list of "trusted" sources in this thread, hopefully. We will talk to our Artifactory rep about this, and we may talk to the upstream folks such as PyTorch or Huggingface.

Feb 03 '22 20:02 blairdrummond

Another site that should be mirrored is https://cdn.proj.org/, which has additional geographic projections and is used automatically by GDAL if it encounters a projection not saved on the system. This will obviously give an error on a non-internet connected system. That website gives mirroring instructions and says the total size of content is 568 MB.

I found this dockerfile https://github.com/bosborn/proj.4/blob/master/Dockerfile that mirrors from that site, specifically it runs the following:

# Put this first as this is rarely changing
RUN \
    mkdir -p /usr/share/proj; \
    wget --no-verbose --mirror https://cdn.proj.org/; \
    rm -f cdn.proj.org/*.js; \
    rm -f cdn.proj.org/*.css; \
    mv cdn.proj.org/* /usr/share/proj/; \
    rmdir cdn.proj.org

Feb 08 '22 21:02 StanHatko

With GDAL installed in a conda virtual environment, it uses /etc/share/proj as the projections directory (like /etc/share/proj/us_nga_egm96_15.tif) and not usr/share/proj (what the above example uses). The /etc/share/proj is writable by the AAW user, I'm able to put projection files there which then make the corresponding projections usable by GDAL.

More generally (see https://proj.org/resource_files.html), on Linux it will use ${XDG_DATA_HOME}/proj if XDG_DATA_HOME is defined, else ${HOME}/.local/share/proj. For me in the conda virtual environment with GDAL installed XDG_DATA_HOME is /etc/share/proj.

Feb 09 '22 15:02 StanHatko

Great convos here, marking as stale for now. Please create another issue if needed.

Jun 10 '24 17:06 Souheil-Yazji

aaw-kubeflow-containers aaw-kubeflow-containers copied to clipboard

Add Pretrained PyTorch TorchVision Models

aaw-kubeflow-containers
aaw-kubeflow-containers copied to clipboard