doccano-transformer icon indicating copy to clipboard operation
doccano-transformer copied to clipboard

Token level output

Open nk-alex opened this issue 2 years ago • 3 comments

Hi, my question is related to this one . Is this feature already supported?

I'm using doccano to annotate my files and exporting them in .jsonl format. As an output I get something like this:

{"id":1,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]} {"id":2,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]}

My code looks like this:

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open('trensformed.txt', "w", encoding = "utf-8") as file:
    for entry in read_jsonl(filepath=r'admin.jsonl', dataset=NERDataset, encoding='latin-1').to_conll2003(tokenizer=str.split):
        file.write(entry["data"] + "\n")

I'm getting this error : KeyError: 'The file should includes either "labels" or "annotations".' What changes do I need to perform on the doccano output file in order to achieve the desired result?

  • Operating System: Windows 11
  • Python Version Used: 3.10.4
  • doccano-transformer Version: 1.0.2

nk-alex avatar Aug 01 '22 11:08 nk-alex

same issue, have you solved it?

littlestar502 avatar Aug 15 '22 05:08 littlestar502

I haven't, tell me if you have better luck

nk-alex avatar Aug 25 '22 13:08 nk-alex

It seems that, in the JSONL file exported from doccano, the keys are 'label' and not 'labels' as expected

pdbang avatar Oct 14 '22 13:10 pdbang

Ran into the same issue. It seems that doccano updated their jsonl output structure but didn't have time to adapt doccano-transformer to it. For NER dataset jsonl, now there are entities instead of annotations and there're no user (annotator) information for each label anymore. Solution: You need to replace all the entities in your jsonl file to annotations, and edit doccano_transformer\examples.py , change line 29 labels[annotation['user']].append([ to labels[0].append([ , which will give all labels a default user 0.

TiffanyBlews avatar Feb 15 '23 07:02 TiffanyBlews