spaCy
spaCy copied to clipboard
Add a dry run flag to download command
Description
This is a simple implementation of a --dry-run
flag for spacy download
that makes it print the URL of the package to be installed without downloading anything.
I don't have strong opinions about the way this is done, but there are some questions:
- would it be better to print just the URL?
- should there be a method to get the URL for outside use?
- should we be concerned about the flags overlapping with pip (they don't currently)?
I'm not sure how universal this is, but I used -n
for the short flag, which I understand to come from noop
.
Types of change
minor enhancement
Checklist
- [x] I confirm that I have the right to submit this contribution under the project's MIT license.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
I think there's too much danger for this to conflict with the pip flags. I noticed that this was just merged, maybe we can use this somehow instead? https://github.com/pypa/pip/pull/10771
Or alternatively this should be in a separate command (not download
) and without any interaction with pip flags.
There's no way for pip to resolve things without downloading so pip install --dry-run
does the download but then doesn't install in the end, which isn't great here.
OK, we definitely shouldn't use this if pip is adding a flag with the same name. Since pip has a lot of options it seems like the only safe option is adding another command. I'm not sure what it should be called, maybe download-url
?
We could also provide more information and call it something like pipeline-info
.
I think a separate command sounds easier, but the naming is hard. download_url
still sounds too much to me like it might be downloading something.
In general I think people would like to be able to do something like:
spacy whatever en_core_web_sm >> requirements.txt
What if we had commands like spacy pipelines url
and spacy pipelines list
or something? list
giving the full list and url
giving the url of the listed (maybe one or more?) pipelines?
Let's see, I'd never noticed this but we already have spacy info model
, which is only intended for installed models, but maybe we can add a bit of info about how to specify requirements, also for our own trained pipelines that aren't installed. It would require some level of failing gracefully without internet access.
I like this idea - could it be
spacy info en_core_web_sm --url
that would basically work regardless of whether the model is already installed or not?
I was imagining something that could show a bit more about how to specify it as a dependency, so at least some option with more details like:
$ spacy info en_core_web_sm --requirements
setup.cfg:
install_requires =
en_core_web_sm @ URL
poetry:
something with {url = "URL"} (I'd have to look it up)
Also a plain --url
seems useful. Or maybe this should just point to the docs with more details, but something that makes this easier in general.
OK, I have updated this to support this syntax:
spacy info en_core_web_sm --url
Initially I had this work like other info commands and print a table by default, but consdering the comment about piping the output upthread, and that there's only one relevant field, I instead modified to print just the URL of the most recent compatible version of the pipeline when successful.
As far as failure modes:
- if not online, the compatibility check will fail gracefully with a clear message (this is just in the compatibility function)
- if the pipeline name doesn't exist, an error mentioning that will be raised (same as downloading a name that doesn't exist)
- if you didn't pass a pipeline name there's an error for that
There are some unresolved questions. There are lots of things we could add but I think it's better to keep this a simple, minimal feature that can be integrated in other things, so I've kept things simple when in doubt about the below. (In particular, because this is overloading the existing info
command, the flag interactions would get weird. If we want more sophisticated options I think it'd be better to have a new command or mode.)
- this always provides the wheel URL rather than the
.tar.gz
URL. I don't think that's a problem but I'm not sure when one would be preferable over the other. - I didn't add a feature to directly specify the version rather than grabbing the latest compatible one.
- we could also provide more info about models that aren't installed, but that feels like a different feature.
For the --requirements
option, it wouldn't be difficult to add that, I'll look at doing it tomorrow.
I think providing just the wheel URL is fine and just providing the same URL as you would get from spacy download
is also fine.
The no-internet error is pretty ugly here, especially since it's not that clear to users that it's going to need internet access to provide the URL.
Actually if the model is installed locally, it might make more sense to provide that version, or possibly to provide both?
The no-internet error is pretty ugly here, especially since it's not that clear to users that it's going to need internet access to provide the URL.
I don't think it's that bad? It clearly states what the failure was (couldn't look up the compatibility table). It would be nice if we could check compatibility offline, but since we can't... The ideal solution would probably be a URL that could be generated locally that would redirect to the latest compatible version. (There is also the issue that looking up the compatibility table offline doesn't throw an exception, it exits the program, so it would need to be modified to handle it elegantly.)
Actually if the model is installed locally, it might make more sense to provide that version, or possibly to provide both?
I'm not sure I understand the use case? These are the uses I was imagining:
- Download a model in CI for some reason. You probably want the most recent compatible version, and if you don't you can cache the output somehow.
- You need the URL to download the model in a particular way due to a corporate firewall. So you just need the default download URL.
I guess for locally installed models, you might want the URL if you want to declare dependencies for a project you've been working on - is that what you had in mind? For that use case, would it be better to add the download URL to the spacy info model_name
(without the --url option)?
I just pushed a change to add output for spacy info [pipeline] --requirements
. The code is a bit messy because there are two output formats here.
This is the one I prefer:
============== Requirements info for pipeline 'en_core_web_sm' ==============
Release info page:
https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0
setup.cfg:
install_requires =
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
poetry:
en_core_web_sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl" }
This is a little hacky though - it's using the same msg
output as other pre-existing commands, but with an empty key and one value in order to preserve formatting. (This is probably why the markdown option uses print
.)
This is output with one key per entry. This is the same way other commands work but I think it's too hard to read, mainly because of the newline for setup.cfg
format.
============== Requirements info for pipeline 'en_core_web_sm' ==============
Release info page https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0
setup.cfg install_requires =
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
poetry en_core_web_sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl" }
I also feel like at this point spacy info
is getting a little overloaded, and it would be better to have subcommands (like debug data
). But since we already support model names I guess we'd need to have a new keyword if we wanted to do that.
For this PR I think we should leave --requirements
for later and figure out the formatting for the output. My preference is to provide just a URL for ease when scripting, but I think it would also make sense to provide both the download URL and the release info URL (example) in the typical table format used by other existing options.
Just to clarify the state of this a bit, what I think we should do now is:
- leave
--requirements
for a separate PR - output just the URL, without formatting, for use in scripting
About generating different URLs based on installed packages, I'm still not clear what the use case for that is for downloading (it makes sense for --requirements
).
I have left the --requirements
related suggestions in the PR to make it easier to check them, but would delete them before taking them out of draft if we do want to do that separately.
I think this is OK to merge now?
I'll take another look. Can you update the PR description?
Updated the PR description to reflect the current behavior.
I think the functionality is looking good. Do we want to include the changes to the docs in this PR or a follow-up PR?
There are at least these sections:
https://spacy.io/usage/models#download-pip https://spacy.io/usage/models#production
And maybe easier instructions on the models pages themselves? (I think having to dig through releases for the asset URL is the too-difficult part.)
I'd prefer having the docs changes with the PR!
OK, I updated the docs. In addition to adding mention of the new command to the places linked above, I put download links directly in the tables on the model pages. How's that?
Here's what it looks like on the model page: