projects
projects copied to clipboard
project to preprocess spancat datasets into .spacy
This is a project that is meant to help with experimenting with the spancat
and ner
components. Currently it has:
- ConLL (English, German, Spanish, Dutch)
- WiniNeuRal (English, German, Spanish, Dutch)
- Dutch Archeology data set
- AnEM
- Wnut 2017
For the archeology data set it creates a random train/dev/test split and for all other data sets it uses the standard splits. It has a small utility view_spans.py
to show a couple of random examples on the command line from a data set and runs debug data
with the a default spancat
config to show the span-characteristics.
The idea is to have a home for the spancat
/ner
data sets to be able to easily benchmark components on them. Plus others can use these data sets perhaps for pre-training.
The project currently lives here https://github.com/explosion/spancat-datasets/, but there is no reason not to make it public and its nicer if its in the projects in any case as we discussed.