ner-annotator icon indicating copy to clipboard operation
ner-annotator copied to clipboard

Generate MsgPack export/import

Open leonkunert opened this issue 3 years ago • 4 comments
trafficstars

We should try to reimplement the msgPack format from spacy. https://msgpack.org/ should be helpful. Maybe also implement import.

leonkunert avatar Oct 14 '22 09:10 leonkunert

I think the current format that spacy uses for NER data is DocBin. I don't know if there is a open spec that will allow reading and writing this format. Maybe reading the spacy code will help.

Either way, I don't see a big need for msgpack.

tecoholic avatar Oct 14 '22 10:10 tecoholic

The DocBin format is a gzipped MsgPack https://spacy.io/api/docbin

leonkunert avatar Oct 14 '22 10:10 leonkunert

@leonkunert Ah.. I should have RTFD. Thanks for pointing out. Then this is something that should be definitely implemented.

tecoholic avatar Oct 14 '22 10:10 tecoholic

The token, spaces and lengths fields can be difficult. They are serialized numpy arrays.

leonkunert avatar Oct 14 '22 10:10 leonkunert