doccano-transformer icon indicating copy to clipboard operation
doccano-transformer copied to clipboard

Terrible Documentation

Open rjuez00 opened this issue 2 years ago • 10 comments

Feature description

Improve the documentation, how is it possible that I cannot find documentation explaining the different classes? The tool can be as good as you like but I have to read directly the code to understand what features does it have or how to save the datasets once transformed...

rjuez00 avatar Apr 04 '22 14:04 rjuez00

Have you found a way to save the datasets? I'm also having a lot of difficulty saving.

gilokip avatar Apr 17 '22 00:04 gilokip

Hi, yes! When you load a JSONL dataset with read_jsonl then you can cycle through it and each entity is a document with its anotations. With that document you choose the function to transform it you want to use. And then what it returns it has an "id" and what is important in "data" you have the document formatted so you just need to write it into a file.

I leave you here an example: (be careful with the encoding, you might have other encoding, and beware also of the "conll_03" function idk if I typed it correctly)

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open("datasets/test.dataset", "w", encoding = "utf-8") as file:
    for entry in read_jsonl(filepath='datasets/doccanoSplit/test_Anotados.jsonl', dataset=NERDataset, encoding='latin-1').conll_03(tokenizer=str.split):
        file.write(entry["data"] + "\n")

I don't want to brag but I really recommend you to use my fork of doccano_transformer I have solved several bugs very important (some anotations werent being transformed correctly and it didn't save them).

Plus I added some other transformers. The only thing is that the spacy conversor doesn't work anymore and is pending fixing which I do not have time to do right now so mind that.

To install it and use it use: pip install git+https://www.github.com/rjuez00/doccano-transformer

rjuez00 avatar Apr 17 '22 02:04 rjuez00

I followed your instructions but I'm getting this error. https://pastebin.com/FaVxBgY6 what could be the problem. All my annotations are okay

gilokip avatar Apr 18 '22 04:04 gilokip

I followed your instructions but I'm getting this error. https://pastebin.com/FaVxBgY6 what could be the problem. All my annotations are okay

NVM, I fixed it. Apparently, I have to change the 'label' key in my file to "labels". But your solution works

gilokip avatar Apr 18 '22 04:04 gilokip

i encounter issue: KeyError: 'The file should includes either "labels" or "annotations", any suggestions on this?

littlestar502 avatar Aug 15 '22 05:08 littlestar502

On your JSONL document check your keys ad change for the whole document. I guess it was an issue with the annotator where files are saved with the wrong key. So for example if the key is "label" change it to "labels" for the whole JSONL document.

gilokip avatar Aug 15 '22 10:08 gilokip

If someone is looking for the version adapted for camembert (french model), here you can find my version : https://github.com/pdbang/doccano-camembert-transformer

pdbang avatar Oct 25 '22 18:10 pdbang

@rjuez00 Hello Rodrigo,

Thank you for your fork. I have many... ERROR NOT ALL TAGS WERE SAVED TO CONLL03...

My tags are correct on Doccano. When I start your script. I am missing a lot of BIO labels. Seems that the tokenizer is not taking into account punctuation such as ",", ":", and ".".

Any idea? Thx!

AkimfromParis avatar May 06 '23 09:05 AkimfromParis

@AkimParis hello, i am facing the same problem. did you manage to find a solution ?

ghassenhed avatar Jun 06 '23 16:06 ghassenhed

@ghassenhed I made it work but still with a few errors in the output file. Check the PR -> https://github.com/doccano/doccano-transformer/pull/38/files

And my version of Rjuez00...

from doccano_transformer.datasets import NERDataset from doccano_transformer.utils import read_jsonl

with open("train-final-888.txt", "w", encoding = "utf-8") as file: for entry in read_jsonl(filepath='admin.jsonl', dataset=NERDataset, encoding='utf-8').to_conll2003(tokenizer=str.split): file.write(entry["data"] + "\n")

Good luck!

AkimfromParis avatar Jun 07 '23 10:06 AkimfromParis