label-studio
label-studio copied to clipboard
How create dataset format FUNSD ?
example format:
{
"form": [
{
"id": 0,
"text": "Registration No.",
"box": [94,169,191,186],
"linking": [
[0,1]
],
"label": "question",
"words": [
{
"text": "Registration",
"box": [94,169,168,186]
},
{
"text": "No.",
"box": [170,169,191,183]
}
]
},
{
"id": 1,
"text": "533",
"box": [209,169,236,182],
"label": "answer",
"words": [
{
"box": [209,169,236,182
],
"text": "533"
}
],
"linking": [
[0,1]
]
}
]
}
You need to construct a labeling config first, then it will be much easier to understand what you need.
You need to construct a labeling config first, then it will be much easier to understand what you need.
What is construct a labeling config ?
example dataset https://guillaumejaume.github.io/FUNSD/description/
for train model
https://huggingface.co/microsoft/layoutlmv3-base
PS annotator VVG https://www.robots.ox.ac.uk/~vgg/software/via/ to solve this problem
It looks like a standard OCR task, check this labeling config please, will it work for you?
<View>
<Image name="image" value="$ocr"/>
<Labels name="label" toName="image">
<Label value="Text" background="green"/>
<Label value="Handwriting" background="blue"/>
</Labels>
<Rectangle name="bbox" toName="image" strokeWidth="3"/>
<Polygon name="poly" toName="image" strokeWidth="3"/>
<TextArea name="transcription" toName="image"
editable="true"
perRegion="true"
required="true"
maxSubmissions="1"
rows="5"
placeholder="Recognized Text"
displayMode="region-list"
/>
</View>
It looks like a standard OCR task, check this labeling config please, will it work for you?
It doesn't seem to solve the link issue.
here is that part of ddataset
"linking": [
[0,1]
]
Is this a link between bboxes?
It this link between bboxes?
yes
this example
card model https://huggingface.co/spaces/permutans/LayoutLMv3-FUNSD
LS provides links between any objects, try to use relations
@DimIsaev did you solve this in the end? Or you used another tool?
@DimIsaev did you solve this in the end? Or you used another tool?
project paused I'll be back later
I am also having to label something like this.
The labeling tool should use auto OCR so that it becomes easy to work for these cases.
I am also having to label something like this.
The labeling tool should use auto OCR so that it becomes easy to work for these cases.
They added a new feature for this https://github.com/heartexlabs/label-studio-ml-backend/tree/master/label_studio_ml/examples/tesseract
@makseq can we connect on slack? I have an urgent query to be solved. Related to the same issue regarding output format for OCR into funsd dataset.
Check this PR: https://github.com/heartexlabs/label-studio-converter/pull/127. Now you can export to FUNSD using this script.
Unfortunately we can't make 100% compatible conversion to FUNSD format, because it has root bboxes and words bboxes and LS doesn't build root bboxes automatically. So, this converter creates one root bbox with one word inside of it.
I am also having to label something like this. The labeling tool should use auto OCR so that it becomes easy to work for these cases.
They added a new feature for this https://github.com/heartexlabs/label-studio-ml-backend/tree/master/label_studio_ml/examples/tesseract
Hi, @WaterKnight1998, I had a try by following readme. In step 4, setup Tesseract ML backend:
pip install -r label_studio_ml/examples/tesseract/requirements.txt label-studio-ml init my-ml-backend --from label_studio_ml/examples/tesseract/ner_ml_backend.py --force label-studio-ml start my-ml-backend -d -p=9090 --debug
But I cannot find the file label_studio_ml/examples/tesseract/ner_ml_backend.py.
The ner_ml_backend.py is now the tesseract.py file boss
On Tue, 27 Sep 2022 at 8:58 AM, wujushan @.***> wrote:
I am also having to label something like this. The labeling tool should use auto OCR so that it becomes easy to work for these cases.
They added a new feature for this https://github.com/heartexlabs/label-studio-ml-backend/tree/master/label_studio_ml/examples/tesseract
Hi, @WaterKnight1998 https://github.com/WaterKnight1998, I had a try by following readme. In step 4, setup Tesseract ML backend:
pip install -r label_studio_ml/examples/tesseract/requirements.txt label-studio-ml init my-ml-backend --from label_studio_ml/examples/tesseract/ner_ml_backend.py --force label-studio-ml start my-ml-backend -d -p=9090 --debug
But I cannot find the file label_studio_ml/examples/tesseract/ner_ml_backend.py.
— Reply to this email directly, view it on GitHub https://github.com/heartexlabs/label-studio/issues/2634#issuecomment-1258920597, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALUQ7EAZYTOWAWVMANFQRPLWAJSU3ANCNFSM53BYNWUQ . You are receiving this because you commented.Message ID: @.***>
Thanks @knowrohit
Check this PR: HumanSignal/label-studio-converter#127. Now you can export to FUNSD using this script.
Unfortunately we can't make 100% compatible conversion to FUNSD format, because it has root bboxes and words bboxes and LS doesn't build root bboxes automatically. So, this converter creates one root bbox with one word inside of it.
this script doesn't support the linking[] how to do it?