projects
projects copied to clipboard
Add coref project
This adds a training config for the new coref component using OntoNotes / CoNLL 2012 data.
The current config works, but is relatively brittle in relation to the OntoNotes setup. It also only trains the word-level coref component since the span resolver is not quite ready.
Hyperparameters are decent but have not been checked extensively.
Some things that should be done before this is merged:
- [x] better downloading for CoNLL 2012 data (doesn't work due to weird server settings)
- [x] verify python2 installation and OntoNotes paths in code
- [x] separate config with span predictor
- ~~(maybe) support for gold span2head conversion like wl-coref (requires Java)~~ (this isn't needed)
Also, I put this in "experimental", but maybe it belongs somewhere else - I wasn't very sure about that.
Also, I put this in "experimental", but maybe it belongs somewhere else - I wasn't very sure about that.
pipelines
?
OK, I believe the issues with this project file specifically have been sorted out.
The issues with the old coref scripts have been resolved by putting them in their own repo. Some of the path-related issues were not inherent and were just a config issue.
I'm going to leave this in draft until the span predictor is wrapped up, but if you can install the feature/coref
branch of spaCy then this should work.
At this point, building a full pipeline with the feature/coref
branch of spaCy works, and should be possible for anyone with OntoNotes.
This is now based on https://github.com/explosion/spacy-experimental/pull/17 instead of the previous coref PR.
Test failures were due to speed tests on other components, I expect they will be resolved by merge.
To clarify how to run this PR, you should install vanilla spaCy (from PyPI or a local dev env / master) and the branch in this spacy-experimental PR. After that you need to edit the config to include a path to your local copy of OntoNotes. Everything else should be handled by the project file.
I tried this out starting in a clean venv. This project will need requirements.txt
that includes the right version of spacy-experimental
(you could just initially point to the right git+https
or archive
URL?) and also to specify that you need en_core_web_sm
somehow.
The first error I ran into:
================================= preprocess =================================
Running command: /tmp/venv38-1/bin/python3.8 scripts/preprocess.py assets/train.gold.conll corpus/train.spacy
Traceback (most recent call last):
File "scripts/preprocess.py", line 109, in <module>
read_file(sys.argv[1], sys.argv[2])
File "scripts/preprocess.py", line 19, in read_file
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner", "attribute_ruler", "lemmatizer"])
File "/tmp/venv38-1/lib/python3.8/site-packages/spacy/__init__.py", line 51, in load
return util.load_model(
File "/tmp/venv38-1/lib/python3.8/site-packages/spacy/util.py", line 427, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
Also maybe torch
as a requirement somehow? And spacy-transformers
.
preprocess
doesn't seem to have the right deps to skip on re-run?
Thanks for the feedback! I think I have fixed most of what you pointed out.
Also maybe torch as a requirement somehow? And spacy-transformers.
I specified spacy-transformers
as a requirement, and I think that pulls in torch
, but I'm not sure how to specify the right GPU stuff - for example, GPU version will depend on the user's CUDA version. I can call nvcc or something but maybe it's easier to just say "please install spaCy with GPU support" and have a short script check for it and fail at the start?
The spacy version should go in spacy_version
in project.yml
instead?
There's some renaming missing for span_predictor
in terms of the configs and output directories (assemble
fails).
If I rename enough to assemble, then:
==================================== eval ====================================
Running command: /tmp/venv38-1/bin/python3.8 scripts/run_eval.py training/coref corpus/test.spacy
Traceback (most recent call last):
File "scripts/run_eval.py", line 7, in <module>
from spacy.coref_scorer import Evaluator, get_cluster_info, lea
ModuleNotFoundError: No module named 'spacy.coref_scorer'
Thanks for all the feedback! I had been focused on testing the training part, but it looks like there was a lot of cleanup left.
I am doing a test run to confirm everything locally, but I think it should work at this point.
The remaining thing is how to specify a Torch / GPU lib install - I'm not really sure what the right way to do that is.
My test run of training and evaluation with the above changes finished successfully.
Just a warning, but since the components are in experimental now, I'm going to move this from pipelines
to experimental
, which will be disruptive if you have a local copy.
The test I just added is failing in CI - the logs aren't clear, but I think it's just an issue of requirements not being installed in the test env. I'll sort it out Monday.
The previous test failures were due to GPU-related issues. Those are resolved but this is still failing, and I think it's due to disk space related issues (like the strange error upthread). I'm still looking into how to resolve it.
After debugging, I'm still not 100% sure what exit code -9 means, but it looks like this might be an issue with memory and not disk usage. I've changed the tests to use configs with tok2vecs instead of Transformers to test that.
Test works now, so that should make sure this runs on Windows.
One thing that came up during the test is that this repo uses env vars to set overrides on training.max_epochs
and training.max_iterations
. This is a problem when running spacy project assemble
, since that command doesn't allow overriding those values and will result in an error. The fix in this case was just to overwrite that env var before running the test, but I expect this will come up in other projects eventually.
Thanks for getting the test in place! So currently this is running on CPU with tok2vec, right? Is that simply because there's no GPU instance available on the CI?
That's correct, it's running with tok2vec on CPU. That's partly because there's no GPU instance, and partly to keep memory and disk usage down so that the job doesn't die unpredictably.
Going to close and re-open to pull in latest changes in experimental PR...
It looks like the tests are failing on Linux due to memory usage when training the SpanResolver. None of the recent changes should modify that so maybe it's just flaky, but I'll see if I can reduce the memory usage further.
Since the test went green, I've gone ahead and updated the requirements to use v0.6.0 of spacy-experimental
instead of the feature branch. The build will break until the release is actually made, but this will avoid another PR later.
Closing and re-opening to test.