spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Add a dry run flag to download command

Open polm opened this issue 1 year ago • 8 comments

Description

This is a simple implementation of a --dry-run flag for spacy download that makes it print the URL of the package to be installed without downloading anything.

I don't have strong opinions about the way this is done, but there are some questions:

  • would it be better to print just the URL?
  • should there be a method to get the URL for outside use?
  • should we be concerned about the flags overlapping with pip (they don't currently)?

I'm not sure how universal this is, but I used -n for the short flag, which I understand to come from noop.

Types of change

minor enhancement

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [ ] I ran the tests, and all new and existing tests passed.
  • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

polm avatar Jul 21 '22 11:07 polm

I think there's too much danger for this to conflict with the pip flags. I noticed that this was just merged, maybe we can use this somehow instead? https://github.com/pypa/pip/pull/10771

Or alternatively this should be in a separate command (not download) and without any interaction with pip flags.

adrianeboyd avatar Jul 21 '22 12:07 adrianeboyd

There's no way for pip to resolve things without downloading so pip install --dry-run does the download but then doesn't install in the end, which isn't great here.

adrianeboyd avatar Jul 21 '22 12:07 adrianeboyd

OK, we definitely shouldn't use this if pip is adding a flag with the same name. Since pip has a lot of options it seems like the only safe option is adding another command. I'm not sure what it should be called, maybe download-url?

We could also provide more information and call it something like pipeline-info.

polm avatar Jul 24 '22 07:07 polm

I think a separate command sounds easier, but the naming is hard. download_url still sounds too much to me like it might be downloading something.

In general I think people would like to be able to do something like:

spacy whatever en_core_web_sm >> requirements.txt

adrianeboyd avatar Jul 25 '22 06:07 adrianeboyd

What if we had commands like spacy pipelines url and spacy pipelines list or something? list giving the full list and url giving the url of the listed (maybe one or more?) pipelines?

polm avatar Jul 25 '22 06:07 polm

Let's see, I'd never noticed this but we already have spacy info model, which is only intended for installed models, but maybe we can add a bit of info about how to specify requirements, also for our own trained pipelines that aren't installed. It would require some level of failing gracefully without internet access.

adrianeboyd avatar Aug 03 '22 08:08 adrianeboyd

I like this idea - could it be

spacy info en_core_web_sm --url

that would basically work regardless of whether the model is already installed or not?

svlandeg avatar Aug 09 '22 13:08 svlandeg

I was imagining something that could show a bit more about how to specify it as a dependency, so at least some option with more details like:

$ spacy info en_core_web_sm --requirements

setup.cfg:

install_requires = 
    en_core_web_sm @ URL

poetry:

   something with {url = "URL"} (I'd have to look it up)

Also a plain --url seems useful. Or maybe this should just point to the docs with more details, but something that makes this easier in general.

adrianeboyd avatar Aug 09 '22 13:08 adrianeboyd

OK, I have updated this to support this syntax:

spacy info en_core_web_sm --url

Initially I had this work like other info commands and print a table by default, but consdering the comment about piping the output upthread, and that there's only one relevant field, I instead modified to print just the URL of the most recent compatible version of the pipeline when successful.

As far as failure modes:

  • if not online, the compatibility check will fail gracefully with a clear message (this is just in the compatibility function)
  • if the pipeline name doesn't exist, an error mentioning that will be raised (same as downloading a name that doesn't exist)
  • if you didn't pass a pipeline name there's an error for that

There are some unresolved questions. There are lots of things we could add but I think it's better to keep this a simple, minimal feature that can be integrated in other things, so I've kept things simple when in doubt about the below. (In particular, because this is overloading the existing info command, the flag interactions would get weird. If we want more sophisticated options I think it'd be better to have a new command or mode.)

  • this always provides the wheel URL rather than the .tar.gz URL. I don't think that's a problem but I'm not sure when one would be preferable over the other.
  • I didn't add a feature to directly specify the version rather than grabbing the latest compatible one.
  • we could also provide more info about models that aren't installed, but that feels like a different feature.

For the --requirements option, it wouldn't be difficult to add that, I'll look at doing it tomorrow.

polm avatar Aug 16 '22 10:08 polm

I think providing just the wheel URL is fine and just providing the same URL as you would get from spacy download is also fine.

The no-internet error is pretty ugly here, especially since it's not that clear to users that it's going to need internet access to provide the URL.

adrianeboyd avatar Aug 16 '22 17:08 adrianeboyd

Actually if the model is installed locally, it might make more sense to provide that version, or possibly to provide both?

adrianeboyd avatar Aug 16 '22 18:08 adrianeboyd

The no-internet error is pretty ugly here, especially since it's not that clear to users that it's going to need internet access to provide the URL.

I don't think it's that bad? It clearly states what the failure was (couldn't look up the compatibility table). It would be nice if we could check compatibility offline, but since we can't... The ideal solution would probably be a URL that could be generated locally that would redirect to the latest compatible version. (There is also the issue that looking up the compatibility table offline doesn't throw an exception, it exits the program, so it would need to be modified to handle it elegantly.)

Actually if the model is installed locally, it might make more sense to provide that version, or possibly to provide both?

I'm not sure I understand the use case? These are the uses I was imagining:

  1. Download a model in CI for some reason. You probably want the most recent compatible version, and if you don't you can cache the output somehow.
  2. You need the URL to download the model in a particular way due to a corporate firewall. So you just need the default download URL.

I guess for locally installed models, you might want the URL if you want to declare dependencies for a project you've been working on - is that what you had in mind? For that use case, would it be better to add the download URL to the spacy info model_name (without the --url option)?

polm avatar Aug 17 '22 06:08 polm

I just pushed a change to add output for spacy info [pipeline] --requirements. The code is a bit messy because there are two output formats here.

This is the one I prefer:

============== Requirements info for pipeline 'en_core_web_sm' ==============



    Release info page:

    https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0

    setup.cfg:

    install_requires =
        en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl

    poetry:

    en_core_web_sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl" }

This is a little hacky though - it's using the same msg output as other pre-existing commands, but with an empty key and one value in order to preserve formatting. (This is probably why the markdown option uses print.)

This is output with one key per entry. This is the same way other commands work but I think it's too hard to read, mainly because of the newline for setup.cfg format.

============== Requirements info for pipeline 'en_core_web_sm' ==============

Release info page   https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0
setup.cfg           install_requires =
        en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
poetry              en_core_web_sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl" }

I also feel like at this point spacy info is getting a little overloaded, and it would be better to have subcommands (like debug data). But since we already support model names I guess we'd need to have a new keyword if we wanted to do that.

For this PR I think we should leave --requirements for later and figure out the formatting for the output. My preference is to provide just a URL for ease when scripting, but I think it would also make sense to provide both the download URL and the release info URL (example) in the typical table format used by other existing options.

polm avatar Aug 18 '22 11:08 polm

Just to clarify the state of this a bit, what I think we should do now is:

  1. leave --requirements for a separate PR
  2. output just the URL, without formatting, for use in scripting

About generating different URLs based on installed packages, I'm still not clear what the use case for that is for downloading (it makes sense for --requirements).

I have left the --requirements related suggestions in the PR to make it easier to check them, but would delete them before taking them out of draft if we do want to do that separately.

polm avatar Aug 23 '22 06:08 polm

I think this is OK to merge now?

polm avatar Aug 24 '22 05:08 polm

I'll take another look. Can you update the PR description?

adrianeboyd avatar Aug 24 '22 05:08 adrianeboyd

Updated the PR description to reflect the current behavior.

polm avatar Aug 24 '22 06:08 polm

I think the functionality is looking good. Do we want to include the changes to the docs in this PR or a follow-up PR?

There are at least these sections:

https://spacy.io/usage/models#download-pip https://spacy.io/usage/models#production

And maybe easier instructions on the models pages themselves? (I think having to dig through releases for the asset URL is the too-difficult part.)

adrianeboyd avatar Aug 29 '22 11:08 adrianeboyd

I'd prefer having the docs changes with the PR!

svlandeg avatar Aug 29 '22 12:08 svlandeg

OK, I updated the docs. In addition to adding mention of the new command to the places linked above, I put download links directly in the tables on the model pages. How's that?

polm avatar Aug 30 '22 09:08 polm

Here's what it looks like on the model page:

2022-08-30T18-26-35

polm avatar Aug 30 '22 09:08 polm