iranlowo Improve BIG file dependencies

We need to fix the big file dependencies in this project:

The pre-trained ADR model (binary) is a 88MB file living in the model folder. This make a very heavy upload/download from PYPI.
The torch dependency in requirements.txt by default pulls down the GPU version of torch. This makes integration with Heroku and RTD difficult/impossible because of hard-size-limits. It would be better to integrate and use a CPU-only version. Is this compatible with Travis CI and requirements.txt??

To facilitate all this:

all the ADR pre-trained models live in this Bintray artifactory
Is some clever way (or post install script) that we can download them locally as needed
The upside is that the iranlowo download is fast/small and then if you can separately pull down the models to do inference/prediction.

Jul 07 '19 20:07 ruohoruotsi

A possible workaround here would be to have a standalone repository for models. So, if a user needs any functionality tied to a model, a check comes up to see if they've cloned / downloaded the model. If not, an error is raised. This is how I've seen a lot of projects handle this challenge. On travis end, we can have it clone that same repository each time a test needs to be run. Major challenge here is having the user do multiple installs.

I'm not very familiar with torch as I use keras more but why haven't we considered zipping the file yet? Is that going to reduce performance somehow? If not, it'll solve the challenge of having to do multiple installs.

Jul 09 '19 05:07 Olamyy

Let me tackle matters in order:

Regarding "standalone repository for models", I've been saving the models here because pre-optimization (April 2019 time frame), the models were 200MB and too big for github. I listed the link in the top post above ☝️

all the ADR pre-trained models live in this Bintray artifactory

Regarding zipping the file, I optimized the size of the pytorch model. See this issue. It basically removed the "intermediate back-propagation information" necessary to continuing to train from particular model checkpoint. I don't think additional optimization will gain much, but that is another experiment to see what the exact compression factor is.
Finally, back in April when I was trying to get things started and the 200MB model wasn't going to go onto github and I was using the Bintray artifcatory, I thought perhaps I could use pre-install step to setup.py and I asked this question on the repo of the setupmeta project used to easy setup. And the answer is yes, you can use a pre/post-install step to programmatically download from the artifcatory, so that is the path I think we need to explore, I can tackle this next week, it think it'll take some experimentation (trail & error) to ensure that things work smoothly.

This the StackOverflow thread with more details/instructions to implement

Jul 20 '19 06:07 ruohoruotsi

iranlowo iranlowo copied to clipboard

Improve BIG file dependencies

iranlowo
iranlowo copied to clipboard