Search icon indicating copy to clipboard operation
Search copied to clipboard

Compare config.cfg from Prodigy for spaCy 3 with spacy init config

Open pafonta opened this issue 4 years ago • 9 comments

Context

A config.cfg has been created in #274 with the recommended settings (spacy init config).

The new version of Prodigy will automatically create the config.cfg with prodigy data-to-spacy.

At the moment, this new version of Prodigy isn't released for all yet.

Goal

The goal of this issue is to check the differences between the config.cfg created in #274 with the one of the new Prodigy version.

This might then include to check the performances of the NER model(s) with the config.cfg from the new Prodigy.

Reference

https://support.prodi.gy/t/prodigy-nightly-spacy-v3-support-ui-for-overlapping-spans-improved-feeds-more/3861

pafonta avatar Mar 17 '21 14:03 pafonta

Let's try to unblock this by getting the nightly program! https://form.typeform.com/to/qgvLcg0K

FrancescoCasalegno avatar Apr 06 '21 13:04 FrancescoCasalegno

Awaiting for prodigy to reply :) Screenshot 2021-04-06 at 18 36 15

FrancescoCasalegno avatar Apr 06 '21 16:04 FrancescoCasalegno

Running

prodigy data-to-spacy \
    --lang en \
    --ner annotations15_EmmanuelleLogette_2020-09-22_raw9_Pathway \
    --eval-split 0.1 \
    --base-model en_ner_craft_md \
    --optimize accuracy \
    --verbose tmp

produces the following output.

ℹ Using base model 'en_ner_craft_md'

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 134 | Evaluation: 14 (10% split)
Training: 134 | Evaluation: 14
Labels: ner (1)
  - [ner] PATHWAY
/usr/local/lib/python3.7/dist-packages/spacy/training/iob_utils.py:142: UserWarning: [W030] Some entities could not be aligned in the text "Electrochemical potential-driven transporters (cla..." with entities "[(187, 196, 'PATHWAY'), (331, 344, 'PATHWAY'), (61...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
✔ Saved 134 training examples
tmp/train.spacy
✔ Saved 14 evaluation examples
tmp/dev.spacy

============================= Generating config =============================
ℹ Using config from base model
✔ Generated training config

======================== Generating cached label data ========================
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/casalegn/.local/lib/python3.7/site-packages/prodigy/__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 505, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/casalegn/.local/lib/python3.7/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/casalegn/.local/lib/python3.7/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/casalegn/.local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 435, in data_to_spacy
    nlp = spacy_init_nlp(config, use_gpu=0 if gpu else -1)  # ID doesn't matter
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/initialize.py", line 57, in init_nlp
    train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
  File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 474, in resolve_dot_names
    result = registry.resolve(config[section])
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 723, in resolve
    config, schema=schema, overrides=overrides, validate=validate, resolve=True
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 772, in _make
    config, schema, validate=validate, overrides=overrides, resolve=resolve
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 825, in _fill
    promise_schema = cls.make_promise_schema(value, resolve=resolve)
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 1016, in make_promise_schema
    func = cls.get(reg_name, func_name)
  File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 141, in get
    ) from None
catalogue.RegistryError: [E893] Could not find function 'specialized_ner_reader' in function registry 'readers'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: prodigy.MergedCorpus.v1, prodigy.NERCorpus.v1, prodigy.ParserCorpus.v1, prodigy.TaggerCorpus.v1, prodigy.TextCatCorpus.v1, spacy.Corpus.v1, spacy.JsonlCorpus.v1, spacy.read_labels.v1, srsly.read_json.v1, srsly.read_jsonl.v1, srsly.read_msgpack.v1, srsly.read_yaml.v1

FrancescoCasalegno avatar Apr 07 '21 13:04 FrancescoCasalegno

See https://support.prodi.gy/t/could-not-find-function-specialized-ner-reader-in-function-registry-readers/4109

FrancescoCasalegno avatar Apr 07 '21 13:04 FrancescoCasalegno

AllenAI people forgot to use --code with spacy package.

https://github.com/allenai/scispacy/blob/4ade4ec897fa48c2ecf3187caa08a949920d126d/project.yml#L593

pafonta avatar Apr 07 '21 14:04 pafonta

As our corpora are in a format different from the scispaCy one, we don't actually need their custom reader (specialized_ner_reader). If we can tell prodigy to generate a config.cfg with the reader we use (spacy.Corpus.v1), it would provide a useful workaround.

To prevent us some back and forth, I have already asked on the thread on the Prodigy forum.

pafonta avatar Apr 08 '21 07:04 pafonta

We are still awaiting for prodigy to come up with a proper solution to fix the issues we have when trying to call

prodigy data-to-spacy \
    --lang en \
    --ner annotations15_EmmanuelleLogette_2020-09-22_raw9_Pathway \
    --eval-split 0.1 \
    --base-model en_ner_craft_md \
    --optimize accuracy \
    --verbose tmp

But in the meantime, w/o the --base_model, this is the config.cfg file that gets generated: config.cfg

FrancescoCasalegno avatar Apr 09 '21 15:04 FrancescoCasalegno

Prodigy people commented here that this use case is a bit of a corner case, and currently not supported.

Hopefully they will come back to us with some update once there's a solution for this.

FrancescoCasalegno avatar Apr 13 '21 14:04 FrancescoCasalegno

Hello @FrancescoCasalegno !

What about closing this issue?

The goal of the issue was to see what config.cfg Prodigy was generating for fine-tuning an existing model.

It is not relevant any more. Indeed, now we train from scratch new models.

pafonta avatar Jul 19 '21 07:07 pafonta