doccano-transformer
doccano-transformer copied to clipboard
Terrible Documentation
Feature description
Improve the documentation, how is it possible that I cannot find documentation explaining the different classes? The tool can be as good as you like but I have to read directly the code to understand what features does it have or how to save the datasets once transformed...
Have you found a way to save the datasets? I'm also having a lot of difficulty saving.
Hi, yes! When you load a JSONL dataset with read_jsonl then you can cycle through it and each entity is a document with its anotations. With that document you choose the function to transform it you want to use. And then what it returns it has an "id" and what is important in "data" you have the document formatted so you just need to write it into a file.
I leave you here an example: (be careful with the encoding, you might have other encoding, and beware also of the "conll_03" function idk if I typed it correctly)
from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open("datasets/test.dataset", "w", encoding = "utf-8") as file:
for entry in read_jsonl(filepath='datasets/doccanoSplit/test_Anotados.jsonl', dataset=NERDataset, encoding='latin-1').conll_03(tokenizer=str.split):
file.write(entry["data"] + "\n")
I don't want to brag but I really recommend you to use my fork of doccano_transformer I have solved several bugs very important (some anotations werent being transformed correctly and it didn't save them).
Plus I added some other transformers. The only thing is that the spacy conversor doesn't work anymore and is pending fixing which I do not have time to do right now so mind that.
To install it and use it use:
pip install git+https://www.github.com/rjuez00/doccano-transformer
I followed your instructions but I'm getting this error. https://pastebin.com/FaVxBgY6 what could be the problem. All my annotations are okay
I followed your instructions but I'm getting this error. https://pastebin.com/FaVxBgY6 what could be the problem. All my annotations are okay
NVM, I fixed it. Apparently, I have to change the 'label' key in my file to "labels". But your solution works
i encounter issue: KeyError: 'The file should includes either "labels" or "annotations", any suggestions on this?
On your JSONL document check your keys ad change for the whole document. I guess it was an issue with the annotator where files are saved with the wrong key. So for example if the key is "label" change it to "labels" for the whole JSONL document.
If someone is looking for the version adapted for camembert (french model), here you can find my version : https://github.com/pdbang/doccano-camembert-transformer
@rjuez00 Hello Rodrigo,
Thank you for your fork. I have many... ERROR NOT ALL TAGS WERE SAVED TO CONLL03...
My tags are correct on Doccano. When I start your script. I am missing a lot of BIO labels. Seems that the tokenizer is not taking into account punctuation such as ",", ":", and ".".
Any idea? Thx!
@AkimParis hello, i am facing the same problem. did you manage to find a solution ?
@ghassenhed I made it work but still with a few errors in the output file. Check the PR -> https://github.com/doccano/doccano-transformer/pull/38/files
And my version of Rjuez00...
from doccano_transformer.datasets import NERDataset from doccano_transformer.utils import read_jsonl
with open("train-final-888.txt", "w", encoding = "utf-8") as file: for entry in read_jsonl(filepath='admin.jsonl', dataset=NERDataset, encoding='utf-8').to_conll2003(tokenizer=str.split): file.write(entry["data"] + "\n")
Good luck!