donut icon indicating copy to clipboard operation
donut copied to clipboard

How to train and annotate on custom dataset

Open qustions opened this issue 1 year ago • 2 comments

Hello @gwkrsrch First I want to thank you guys for open sourcing this amazing project. Maybe my questions are very common and silly but it would help me and others to get more clarity. I am trying to train custom Document Information Extraction but to annotate, i don't know which tool to use but in the comment by @VictorAtPL i have seen they are using label studio OCR template to annotate the images this is the exported example of label studio.

[
  {
    "ocr": "/data/upload/1/fe00.png",
    "id": 2,
    "bbox": [
      {
        "x": 20.62937062937063,
        "y": 23.60248447204969,
        "width": 18.88111888111888,
        "height": 8.695652173913043,
        "rotation": 0,
        "original_width": 1920,
        "original_height": 1080
      }
    ],
    "transcription": "Definitions",
    "annotator": 1,
    "annotation_id": 2,
    "created_at": "2022-09-06T23:23:49.284150Z",
    "updated_at": "2022-09-06T23:23:49.284176Z",
    "lead_time": 265.562
  }
]

My questions is

  1. Which is the best tool for annotating for donut Custom Document Information Extraction
  2. Should we annotate the text box + write the text, as in the example? if yes what will be the efficient way to do it.
  3. and is there any converter script which converts label studio format to donut format
  4. Is there any document where there is start to end training of custom data with annotation?

qustions avatar Sep 07 '22 00:09 qustions

@VictorAtPL It would be great if you could share your label studio template

qustions avatar Sep 07 '22 19:09 qustions

Try using sparrow ui tool created by katanaml you will get youtube video for instructions and how to use to.

https://github.com/katanaml/sparrow

This will save your time too.

sudhitpanchal avatar Oct 16 '23 06:10 sudhitpanchal