projects icon indicating copy to clipboard operation
projects copied to clipboard

project to preprocess spancat datasets into .spacy

Open kadarakos opened this issue 1 year ago • 1 comments

This is a project that is meant to help with experimenting with the spancat and ner components. Currently it has:

  1. ConLL (English, German, Spanish, Dutch)
  2. WiniNeuRal (English, German, Spanish, Dutch)
  3. Dutch Archeology data set
  4. AnEM
  5. Wnut 2017

For the archeology data set it creates a random train/dev/test split and for all other data sets it uses the standard splits. It has a small utility view_spans.py to show a couple of random examples on the command line from a data set and runs debug data with the a default spancat config to show the span-characteristics.

The idea is to have a home for the spancat/ner data sets to be able to easily benchmark components on them. Plus others can use these data sets perhaps for pre-training.

kadarakos avatar Aug 25 '22 15:08 kadarakos

The project currently lives here https://github.com/explosion/spancat-datasets/, but there is no reason not to make it public and its nicer if its in the projects in any case as we discussed.

kadarakos avatar Aug 25 '22 15:08 kadarakos