ExplainaBoard
ExplainaBoard copied to clipboard
Unsafe en_core_web_sm downloading in setup.py
Currently setup.py
will execute an external command python -m spacy download en_core_web_sm
to install a spaCy model during setup. This approach has several issues about system consystency:
- spaCy models are intendedly not registered to PyPI, and PyPI does not allow libraries depending on external requirements.
- The command is just a system command which possibly breaks the system, or won't work correctly.
Since there is no recommended way to add spaCy models to install_requires
, we need to take either of follows:
- Download the model programatically when
spacy.load()
fails. - Bundle the model file into this repository.
- Ask users to download appropriate models additionally.
Some comments:
- Download the model programatically when
spacy.load()
fails.- This is easy option for novices, but the system is still changed unexpectedly. This option won't work in offline systems.
- Bundle the model file into this repository.
- This is also easy yo use, but the size of the repository or distribution packages grows, and we have responsibility to maintain bundled models.
- Ask users to download appropriate models additionally.
- This is desirable for the ease of maintenance, but users are required to type an additional command.
Any comments? @pfliu-nlp @neubig
I'm not 100% sure I understand the issue here actually. So I can understand better: what is an example of a situation where the current method causes a problem for users?
Also, as a meta-comment: currently we are planning that majority of users of the SDK in this repo will be developers or power-users who want to actually program new analyses.
We expect that most end-users will interact mostly with the web interface, and won't have to install this repo. Because of this, the emphasis on this repo being extremely easy for everyone to install is not quite as high as it may be for some other consumer software. That being said, it shouldn't be hard of course :)
It is generally a strong assumption that every system command can work as we implicitly expected. for example:
- A case we could meet easily is that the "python" command is not supported officially by several operating systems, e.g., recent versions of Ubuntu requires additional installs.
- Some build systems (e.g., Bazel) hacks the system commands to process tool specific things. In this case the command someone invoked is sometimes not the command we expected to be run.
In addition, since the spacy model is additionally installed by the external command, it is out of dependency management at installation. This means that pip could not resolve version collisions when some other tools also requires the spacy model.
Depending on external resources from PyPI packages is not viable according to their policy: https://pip.pypa.io/en/stable/news/#id666 so we need to avoid such installation if we use PyPI.
hey, @odashi thanks for the nice suggestion and valuable discussion with Graham.
"Currently setup.py will execute an external command python -m spacy download en_core_web_sm to install a spaCy model during setup. This approach has several issues about system consystency:"
if this is the case and I prefer the 1st and 3rd options that you have mentioned
Regarding the 1st option:
"Download the model programatically when spacy.load() fails. This is easy option for novices, but the system is still changed unexpectedly. This option won't work in offline systems"
Maybe this ("This option won't work in offline systems") isn't a very urgent issue for us currently?
"Ask users to download appropriate models additionally"
This seems another common practice we can adopt, Alternatively, we could also let users download the model via:
pip install https://huggingface.co/spacy/en_core_web_sm/resolve/main/en_core_web_sm-any-py3-none-any.whl
Comparing 1st and 3rd one, maybe the 1st one would be better?
Maybe this ("This option won't work in offline systems") isn't a very urgent issue for us currently?
Backend services on clusters sometimes hit this situation if they don't have outbound access.
I found this line in the repo that maybe does the same thing: https://github.com/neulab/ExplainaBoard/blob/2f3a9fb151e995fde4d72d7acd6fd2531ef6101b/explainaboard/utils/feature_functions/sum_attribute.py#L11 and I guessed the service cluster has outbound access anyway. So the answer for now is "no not very urgent."
Gotcha, makes sense. I guess we can be consistent in using method "1." for now, but revisit this if it becomes an issue. Thanks!
I found that the current setup script runs the model installation regardless of the given command of setup.py (i.e., other commands than install
installs the spacy model). This seems to be unintended.