doccano-transformer
doccano-transformer copied to clipboard
Token level output
Hi, my question is related to this one . Is this feature already supported?
I'm using doccano to annotate my files and exporting them in .jsonl format. As an output I get something like this:
{"id":1,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]} {"id":2,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]}
My code looks like this:
from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open('trensformed.txt', "w", encoding = "utf-8") as file:
for entry in read_jsonl(filepath=r'admin.jsonl', dataset=NERDataset, encoding='latin-1').to_conll2003(tokenizer=str.split):
file.write(entry["data"] + "\n")
I'm getting this error : KeyError: 'The file should includes either "labels" or "annotations".' What changes do I need to perform on the doccano output file in order to achieve the desired result?
- Operating System: Windows 11
- Python Version Used: 3.10.4
- doccano-transformer Version: 1.0.2
same issue, have you solved it?
I haven't, tell me if you have better luck
It seems that, in the JSONL file exported from doccano, the keys are 'label' and not 'labels' as expected
Ran into the same issue. It seems that doccano updated their jsonl output structure but didn't have time to adapt doccano-transformer to it. For NER dataset jsonl, now there are entities
instead of annotations
and there're no user (annotator) information for each label anymore.
Solution:
You need to replace all the entities
in your jsonl file to annotations
, and edit doccano_transformer\examples.py
, change line 29 labels[annotation['user']].append([
to labels[0].append([
, which will give all labels a default user 0.