snorkel
snorkel copied to clipboard
Include support for spaCy v3
Is your feature request related to a problem? Please describe.
I want to use spaCy v3 to get transformers models. I want to use snorkel to do text extraction by giving it examples created in spaCy's annotation format. Just like the spouse demo only why do in tensorflow what spaCy does out of the box?
Describe the solution you'd like
Bump requirements.txt entry to spaCy to 3.0 and test the system, then include support in the next release.
Describe alternatives you've considered
It's probably possible to generate the v3 annotation format in v2 then run spaCy v3 in a separate program.
Additional context
spaCy v3 is amazing! https://nightly.spacy.io/
Hi @rjurney, following up on #1630, we marked this as help wanted
. If you want to contribute a PR to support spaCy 3.x, let us know and we can discuss the approach and review when the time comes.
@henryre Sounds reasonable. I'm interested in doing this work. I'm looking at the spaCy v3 migration docs and I have a couple of questions:
- Does Snorkel use any custom pipeline components or factories?
- Is there a spaCy config file?
- Does it use the standard tokenizer or is it custom? With standard or modified settings?
- Does it use tag maps or morph rules?
If you aren't sure, I can figure the answers out myself, but it doesn't look like a difficult migration based on the spacy code I've read in Snorkel in the past.
@rjurney agreed, shouldn't be too difficult. The library isn't very opinionated about spaCy usage, so I don't expect any of the above to come into play. The spaCy-based wrappers are primarily contained in the following:
- https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/lf/nlp.py
- https://github.com/snorkel-team/snorkel/blob/master/snorkel/preprocess/nlp.py
- https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/lf/nlp_spark.py
- https://github.com/snorkel-team/snorkel/blob/master/snorkel/slicing/sf/nlp.py
@yinxiangshi I got tox -e complex
to run. I am looking over the relevant files to see if there are anything we missed. I didn't quite get your comments about config - I am not sure how that changes things, unless we want to add spaCy config support to snorkel. I suppose that is reasonable, let me look!
@yinxiangshit I found this, and am digging in... https://spacy.io/usage/v3#features-training