projects Add coref project

This adds a training config for the new coref component using OntoNotes / CoNLL 2012 data.

The current config works, but is relatively brittle in relation to the OntoNotes setup. It also only trains the word-level coref component since the span resolver is not quite ready.

Hyperparameters are decent but have not been checked extensively.

Some things that should be done before this is merged:

[x] better downloading for CoNLL 2012 data (doesn't work due to weird server settings)
[x] verify python2 installation and OntoNotes paths in code
[x] separate config with span predictor
~~(maybe) support for gold span2head conversion like wl-coref (requires Java)~~ (this isn't needed)

Also, I put this in "experimental", but maybe it belongs somewhere else - I wasn't very sure about that.

Mar 31 '22 09:03 polm

Also, I put this in "experimental", but maybe it belongs somewhere else - I wasn't very sure about that.

pipelines ?

Mar 31 '22 12:03 svlandeg

OK, I believe the issues with this project file specifically have been sorted out.

The issues with the old coref scripts have been resolved by putting them in their own repo. Some of the path-related issues were not inherent and were just a config issue.

I'm going to leave this in draft until the span predictor is wrapped up, but if you can install the feature/coref branch of spaCy then this should work.

Apr 06 '22 10:04 polm

At this point, building a full pipeline with the feature/coref branch of spaCy works, and should be possible for anyone with OntoNotes.

Apr 14 '22 09:04 polm

This is now based on https://github.com/explosion/spacy-experimental/pull/17 instead of the previous coref PR.

Jul 13 '22 09:07 polm

Test failures were due to speed tests on other components, I expect they will be resolved by merge.

To clarify how to run this PR, you should install vanilla spaCy (from PyPI or a local dev env / master) and the branch in this spacy-experimental PR. After that you need to edit the config to include a path to your local copy of OntoNotes. Everything else should be handled by the project file.

Jul 14 '22 08:07 polm

I tried this out starting in a clean venv. This project will need requirements.txt that includes the right version of spacy-experimental (you could just initially point to the right git+https or archive URL?) and also to specify that you need en_core_web_sm somehow.

The first error I ran into:

================================= preprocess =================================
Running command: /tmp/venv38-1/bin/python3.8 scripts/preprocess.py assets/train.gold.conll corpus/train.spacy
Traceback (most recent call last):
  File "scripts/preprocess.py", line 109, in <module>
    read_file(sys.argv[1], sys.argv[2])
  File "scripts/preprocess.py", line 19, in read_file
    nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner", "attribute_ruler", "lemmatizer"])
  File "/tmp/venv38-1/lib/python3.8/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/tmp/venv38-1/lib/python3.8/site-packages/spacy/util.py", line 427, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Also maybe torch as a requirement somehow? And spacy-transformers.

preprocess doesn't seem to have the right deps to skip on re-run?

Jul 14 '22 08:07 adrianeboyd

Thanks for the feedback! I think I have fixed most of what you pointed out.

Also maybe torch as a requirement somehow? And spacy-transformers.

I specified spacy-transformers as a requirement, and I think that pulls in torch, but I'm not sure how to specify the right GPU stuff - for example, GPU version will depend on the user's CUDA version. I can call nvcc or something but maybe it's easier to just say "please install spaCy with GPU support" and have a short script check for it and fail at the start?

Jul 14 '22 10:07 polm

The spacy version should go in spacy_version in project.yml instead?

Jul 14 '22 10:07 adrianeboyd

There's some renaming missing for span_predictor in terms of the configs and output directories (assemble fails).

If I rename enough to assemble, then:

==================================== eval ====================================
Running command: /tmp/venv38-1/bin/python3.8 scripts/run_eval.py training/coref corpus/test.spacy
Traceback (most recent call last):
  File "scripts/run_eval.py", line 7, in <module>
    from spacy.coref_scorer import Evaluator, get_cluster_info, lea
ModuleNotFoundError: No module named 'spacy.coref_scorer'

Jul 14 '22 14:07 adrianeboyd

Thanks for all the feedback! I had been focused on testing the training part, but it looks like there was a lot of cleanup left.

I am doing a test run to confirm everything locally, but I think it should work at this point.

The remaining thing is how to specify a Torch / GPU lib install - I'm not really sure what the right way to do that is.

Jul 19 '22 04:07 polm

My test run of training and evaluation with the above changes finished successfully.

Jul 19 '22 05:07 polm

Just a warning, but since the components are in experimental now, I'm going to move this from pipelines to experimental, which will be disruptive if you have a local copy.

Aug 16 '22 07:08 polm

The test I just added is failing in CI - the logs aren't clear, but I think it's just an issue of requirements not being installed in the test env. I'll sort it out Monday.

Aug 19 '22 11:08 polm

The previous test failures were due to GPU-related issues. Those are resolved but this is still failing, and I think it's due to disk space related issues (like the strange error upthread). I'm still looking into how to resolve it.

Aug 22 '22 07:08 polm

After debugging, I'm still not 100% sure what exit code -9 means, but it looks like this might be an issue with memory and not disk usage. I've changed the tests to use configs with tok2vecs instead of Transformers to test that.

Aug 22 '22 11:08 polm

Test works now, so that should make sure this runs on Windows.

One thing that came up during the test is that this repo uses env vars to set overrides on training.max_epochs and training.max_iterations. This is a problem when running spacy project assemble, since that command doesn't allow overriding those values and will result in an error. The fix in this case was just to overwrite that env var before running the test, but I expect this will come up in other projects eventually.

Aug 23 '22 06:08 polm

Thanks for getting the test in place! So currently this is running on CPU with tok2vec, right? Is that simply because there's no GPU instance available on the CI?

Aug 23 '22 13:08 svlandeg

That's correct, it's running with tok2vec on CPU. That's partly because there's no GPU instance, and partly to keep memory and disk usage down so that the job doesn't die unpredictably.

Aug 23 '22 13:08 polm

Going to close and re-open to pull in latest changes in experimental PR...

Sep 01 '22 02:09 polm

It looks like the tests are failing on Linux due to memory usage when training the SpanResolver. None of the recent changes should modify that so maybe it's just flaky, but I'll see if I can reduce the memory usage further.

Sep 08 '22 06:09 polm

Since the test went green, I've gone ahead and updated the requirements to use v0.6.0 of spacy-experimental instead of the feature branch. The build will break until the release is actually made, but this will avoid another PR later.

Sep 14 '22 10:09 polm

Closing and re-opening to test.

Sep 28 '22 09:09 polm

projects projects copied to clipboard

Add coref project

projects
projects copied to clipboard