snorkel icon indicating copy to clipboard operation
snorkel copied to clipboard

Include support for spaCy v3

Open rjurney opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe.

I want to use spaCy v3 to get transformers models. I want to use snorkel to do text extraction by giving it examples created in spaCy's annotation format. Just like the spouse demo only why do in tensorflow what spaCy does out of the box?

Describe the solution you'd like

Bump requirements.txt entry to spaCy to 3.0 and test the system, then include support in the next release.

Describe alternatives you've considered

It's probably possible to generate the v3 annotation format in v2 then run spaCy v3 in a separate program.

Additional context

spaCy v3 is amazing! https://nightly.spacy.io/

rjurney avatar Dec 31 '20 00:12 rjurney

Hi @rjurney, following up on #1630, we marked this as help wanted. If you want to contribute a PR to support spaCy 3.x, let us know and we can discuss the approach and review when the time comes.

henryre avatar Feb 14 '21 20:02 henryre

@henryre Sounds reasonable. I'm interested in doing this work. I'm looking at the spaCy v3 migration docs and I have a couple of questions:

  • Does Snorkel use any custom pipeline components or factories?
  • Is there a spaCy config file?
  • Does it use the standard tokenizer or is it custom? With standard or modified settings?
  • Does it use tag maps or morph rules?

If you aren't sure, I can figure the answers out myself, but it doesn't look like a difficult migration based on the spacy code I've read in Snorkel in the past.

rjurney avatar Feb 15 '21 03:02 rjurney

@rjurney agreed, shouldn't be too difficult. The library isn't very opinionated about spaCy usage, so I don't expect any of the above to come into play. The spaCy-based wrappers are primarily contained in the following:

  • https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/lf/nlp.py
  • https://github.com/snorkel-team/snorkel/blob/master/snorkel/preprocess/nlp.py
  • https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/lf/nlp_spark.py
  • https://github.com/snorkel-team/snorkel/blob/master/snorkel/slicing/sf/nlp.py

henryre avatar Feb 15 '21 23:02 henryre

@yinxiangshi I got tox -e complex to run. I am looking over the relevant files to see if there are anything we missed. I didn't quite get your comments about config - I am not sure how that changes things, unless we want to add spaCy config support to snorkel. I suppose that is reasonable, let me look!

rjurney avatar May 02 '22 18:05 rjurney

@yinxiangshit I found this, and am digging in... https://spacy.io/usage/v3#features-training

rjurney avatar May 02 '22 18:05 rjurney