label-studio icon indicating copy to clipboard operation
label-studio copied to clipboard

How create dataset format FUNSD ?

Open DimIsaev opened this issue 2 years ago • 9 comments

example format:

        {
        "form": [
        {
            "id": 0,
            "text": "Registration No.",
            "box": [94,169,191,186],
            "linking": [
                [0,1]
            ],
            "label": "question",
            "words": [
                {
                    "text": "Registration",
                    "box": [94,169,168,186]
                },
                {
                    "text": "No.",
                    "box": [170,169,191,183]
                }
            ]
        },
        {
            "id": 1,
            "text": "533",
            "box": [209,169,236,182],
            "label": "answer",
            "words": [
                {
                    "box": [209,169,236,182
                    ],
                    "text": "533"
                }
            ],
            "linking": [
                [0,1]
            ]
        }
    ]
    }

DimIsaev avatar Jul 08 '22 18:07 DimIsaev

You need to construct a labeling config first, then it will be much easier to understand what you need.

makseq avatar Jul 11 '22 17:07 makseq

You need to construct a labeling config first, then it will be much easier to understand what you need.

What is construct a labeling config ?

example dataset https://guillaumejaume.github.io/FUNSD/description/ for train model
https://huggingface.co/microsoft/layoutlmv3-base

PS annotator VVG https://www.robots.ox.ac.uk/~vgg/software/via/ to solve this problem

DimIsaev avatar Jul 11 '22 17:07 DimIsaev

It looks like a standard OCR task, check this labeling config please, will it work for you?

<View>
  <Image name="image" value="$ocr"/>

  <Labels name="label" toName="image">
    <Label value="Text" background="green"/>
    <Label value="Handwriting" background="blue"/>
  </Labels>

  <Rectangle name="bbox" toName="image" strokeWidth="3"/>
  <Polygon name="poly" toName="image" strokeWidth="3"/>

  <TextArea name="transcription" toName="image"
            editable="true"
            perRegion="true"
            required="true"
            maxSubmissions="1"
            rows="5"
            placeholder="Recognized Text"
            displayMode="region-list"
            />
</View>

makseq avatar Jul 11 '22 18:07 makseq

It looks like a standard OCR task, check this labeling config please, will it work for you?

It doesn't seem to solve the link issue.
here is that part of ddataset

            "linking": [
                [0,1]
            ]

DimIsaev avatar Jul 11 '22 18:07 DimIsaev

Is this a link between bboxes?

makseq avatar Jul 11 '22 21:07 makseq

It this link between bboxes?

yes this example download

card model https://huggingface.co/spaces/permutans/LayoutLMv3-FUNSD

DimIsaev avatar Jul 12 '22 08:07 DimIsaev

LS provides links between any objects, try to use relations image

makseq avatar Jul 13 '22 22:07 makseq

@DimIsaev did you solve this in the end? Or you used another tool?

WaterKnight1998 avatar Aug 04 '22 11:08 WaterKnight1998

@DimIsaev did you solve this in the end? Or you used another tool?

project paused I'll be back later

DimIsaev avatar Aug 04 '22 11:08 DimIsaev

I am also having to label something like this.

The labeling tool should use auto OCR so that it becomes easy to work for these cases.

MercyPrasanna avatar Sep 09 '22 09:09 MercyPrasanna

I am also having to label something like this.

The labeling tool should use auto OCR so that it becomes easy to work for these cases.

They added a new feature for this https://github.com/heartexlabs/label-studio-ml-backend/tree/master/label_studio_ml/examples/tesseract image

WaterKnight1998 avatar Sep 09 '22 10:09 WaterKnight1998

@makseq can we connect on slack? I have an urgent query to be solved. Related to the same issue regarding output format for OCR into funsd dataset.

knowrohit avatar Sep 16 '22 07:09 knowrohit

Check this PR: https://github.com/heartexlabs/label-studio-converter/pull/127. Now you can export to FUNSD using this script.

Unfortunately we can't make 100% compatible conversion to FUNSD format, because it has root bboxes and words bboxes and LS doesn't build root bboxes automatically. So, this converter creates one root bbox with one word inside of it.

makseq avatar Sep 19 '22 22:09 makseq

I am also having to label something like this. The labeling tool should use auto OCR so that it becomes easy to work for these cases.

They added a new feature for this https://github.com/heartexlabs/label-studio-ml-backend/tree/master/label_studio_ml/examples/tesseract

Hi, @WaterKnight1998, I had a try by following readme. In step 4, setup Tesseract ML backend:

pip install -r label_studio_ml/examples/tesseract/requirements.txt label-studio-ml init my-ml-backend --from label_studio_ml/examples/tesseract/ner_ml_backend.py --force label-studio-ml start my-ml-backend -d -p=9090 --debug

But I cannot find the file label_studio_ml/examples/tesseract/ner_ml_backend.py.

wujushan avatar Sep 27 '22 03:09 wujushan

The ner_ml_backend.py is now the tesseract.py file boss

On Tue, 27 Sep 2022 at 8:58 AM, wujushan @.***> wrote:

I am also having to label something like this. The labeling tool should use auto OCR so that it becomes easy to work for these cases.

They added a new feature for this https://github.com/heartexlabs/label-studio-ml-backend/tree/master/label_studio_ml/examples/tesseract

Hi, @WaterKnight1998 https://github.com/WaterKnight1998, I had a try by following readme. In step 4, setup Tesseract ML backend:

pip install -r label_studio_ml/examples/tesseract/requirements.txt label-studio-ml init my-ml-backend --from label_studio_ml/examples/tesseract/ner_ml_backend.py --force label-studio-ml start my-ml-backend -d -p=9090 --debug

But I cannot find the file label_studio_ml/examples/tesseract/ner_ml_backend.py.

— Reply to this email directly, view it on GitHub https://github.com/heartexlabs/label-studio/issues/2634#issuecomment-1258920597, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALUQ7EAZYTOWAWVMANFQRPLWAJSU3ANCNFSM53BYNWUQ . You are receiving this because you commented.Message ID: @.***>

knowrohit avatar Sep 27 '22 03:09 knowrohit

Thanks @knowrohit

wujushan avatar Sep 27 '22 06:09 wujushan

Check this PR: HumanSignal/label-studio-converter#127. Now you can export to FUNSD using this script.

Unfortunately we can't make 100% compatible conversion to FUNSD format, because it has root bboxes and words bboxes and LS doesn't build root bboxes automatically. So, this converter creates one root bbox with one word inside of it.

this script doesn't support the linking[] how to do it?

MonaMamdouh66 avatar Jan 23 '24 09:01 MonaMamdouh66