Compare config.cfg from Prodigy for spaCy 3 with spacy init config
Context
A config.cfg has been created in #274 with the recommended settings (spacy init config).
The new version of Prodigy will automatically create the config.cfg with prodigy data-to-spacy.
At the moment, this new version of Prodigy isn't released for all yet.
Goal
The goal of this issue is to check the differences between the config.cfg created in #274 with the one of the new Prodigy version.
This might then include to check the performances of the NER model(s) with the config.cfg from the new Prodigy.
Reference
https://support.prodi.gy/t/prodigy-nightly-spacy-v3-support-ui-for-overlapping-spans-improved-feeds-more/3861
Let's try to unblock this by getting the nightly program! https://form.typeform.com/to/qgvLcg0K
Awaiting for prodigy to reply :)

Running
prodigy data-to-spacy \
--lang en \
--ner annotations15_EmmanuelleLogette_2020-09-22_raw9_Pathway \
--eval-split 0.1 \
--base-model en_ner_craft_md \
--optimize accuracy \
--verbose tmp
produces the following output.
ℹ Using base model 'en_ner_craft_md'
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 134 | Evaluation: 14 (10% split)
Training: 134 | Evaluation: 14
Labels: ner (1)
- [ner] PATHWAY
/usr/local/lib/python3.7/dist-packages/spacy/training/iob_utils.py:142: UserWarning: [W030] Some entities could not be aligned in the text "Electrochemical potential-driven transporters (cla..." with entities "[(187, 196, 'PATHWAY'), (331, 344, 'PATHWAY'), (61...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
✔ Saved 134 training examples
tmp/train.spacy
✔ Saved 14 evaluation examples
tmp/dev.spacy
============================= Generating config =============================
ℹ Using config from base model
✔ Generated training config
======================== Generating cached label data ========================
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/casalegn/.local/lib/python3.7/site-packages/prodigy/__main__.py", line 54, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 505, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/casalegn/.local/lib/python3.7/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/casalegn/.local/lib/python3.7/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/casalegn/.local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 435, in data_to_spacy
nlp = spacy_init_nlp(config, use_gpu=0 if gpu else -1) # ID doesn't matter
File "/usr/local/lib/python3.7/dist-packages/spacy/training/initialize.py", line 57, in init_nlp
train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 474, in resolve_dot_names
result = registry.resolve(config[section])
File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 723, in resolve
config, schema=schema, overrides=overrides, validate=validate, resolve=True
File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 772, in _make
config, schema, validate=validate, overrides=overrides, resolve=resolve
File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 825, in _fill
promise_schema = cls.make_promise_schema(value, resolve=resolve)
File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 1016, in make_promise_schema
func = cls.get(reg_name, func_name)
File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 141, in get
) from None
catalogue.RegistryError: [E893] Could not find function 'specialized_ner_reader' in function registry 'readers'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.
Available names: prodigy.MergedCorpus.v1, prodigy.NERCorpus.v1, prodigy.ParserCorpus.v1, prodigy.TaggerCorpus.v1, prodigy.TextCatCorpus.v1, spacy.Corpus.v1, spacy.JsonlCorpus.v1, spacy.read_labels.v1, srsly.read_json.v1, srsly.read_jsonl.v1, srsly.read_msgpack.v1, srsly.read_yaml.v1
See https://support.prodi.gy/t/could-not-find-function-specialized-ner-reader-in-function-registry-readers/4109
AllenAI people forgot to use --code with spacy package.
https://github.com/allenai/scispacy/blob/4ade4ec897fa48c2ecf3187caa08a949920d126d/project.yml#L593
As our corpora are in a format different from the scispaCy one, we don't actually need their custom reader (specialized_ner_reader). If we can tell prodigy to generate a config.cfg with the reader we use (spacy.Corpus.v1), it would provide a useful workaround.
To prevent us some back and forth, I have already asked on the thread on the Prodigy forum.
We are still awaiting for prodigy to come up with a proper solution to fix the issues we have when trying to call
prodigy data-to-spacy \
--lang en \
--ner annotations15_EmmanuelleLogette_2020-09-22_raw9_Pathway \
--eval-split 0.1 \
--base-model en_ner_craft_md \
--optimize accuracy \
--verbose tmp
But in the meantime, w/o the --base_model, this is the config.cfg file that gets generated: config.cfg
Prodigy people commented here that this use case is a bit of a corner case, and currently not supported.
Hopefully they will come back to us with some update once there's a solution for this.
Hello @FrancescoCasalegno !
What about closing this issue?
The goal of the issue was to see what config.cfg Prodigy was generating for fine-tuning an existing model.
It is not relevant any more. Indeed, now we train from scratch new models.