Transformers-Tutorials icon indicating copy to clipboard operation
Transformers-Tutorials copied to clipboard

Train layoutlmv3 with custom dataset by loading from local directory

Open deepanshudashora opened this issue 2 years ago • 10 comments

How to train layoutlmv3 with custom dataset by loading from local directory,

deepanshudashora avatar Jun 10 '22 04:06 deepanshudashora

@deepanshudashora

where u able to create your own dataset ? you can try using Label Studio OCR.

jyotiyadav94 avatar Jun 12 '22 17:06 jyotiyadav94

Hi,

You can create a regular PyTorch Dataset as follows:

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

         return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

NielsRogge avatar Jun 15 '22 09:06 NielsRogge

Hi @NielsRogge , is it possible to get in touch with you somehow? I could not find your email address, your twitter messages are blocked. If you won a billion dollars, there would be no way to tell you about it :D

I am Ivan from www.Photopea.com , wihch is probably the best free photo editor that exists today :) Would you be interested in some kind of a cooperation, that is ? We have millions of users. We would like to add AI features, which could run on the Hugging Face infrastructure. You can write me on Twitter: https://twitter.com/photopeacom or [email protected]

photopea avatar Jun 17 '22 10:06 photopea

hi I created the annotations folder containing json files , like this:

{
    "form": [
        {
            "box": [
                84,
                109,
                136,
                119
            ],
            "text": "23456789",
            "label": "invoice_num",
            "words": [
                {
                    "box": [
                        84,
                        109,
                        136,
                        119
                    ],
                    "text": "23456789"
                }
            ]
.
.
.

please guide me, how can I train layoutlmV3 on this.

aditya11ad avatar Jul 13 '22 11:07 aditya11ad

Hi @aditya11ad

you need to follow the script in order LayoutLMV3 to accept the input - https://huggingface.co/datasets/nielsr/funsd-layoutlmv3/blob/main/funsd-layoutlmv3.py

jyotiyadav94 avatar Jul 13 '22 12:07 jyotiyadav94

thanks for the quick response, but i didn't get how this script is taking the inputs .

aditya11ad avatar Jul 13 '22 12:07 aditya11ad

Actually this script takes bounding box inputs with left top and right bottom..This should be normalised for x with width and y with height..For token alone, if it's start of the word in the sentence, token should be extended with B- ie beginning otherwise it's I- ie intermediate.. Actual data gets downloaded from the site then refer annotation folder for further information. To get an annotation use any OCR option like Google tesseract/azure form recogniser

techthiyanes avatar Jul 13 '22 13:07 techthiyanes

Hi,

You don't necessarily have to write a script like the one for FUNSD. You can just create a custom PyTorch dataset, which I explain here: https://github.com/NielsRogge/Transformers-Tutorials/issues/123#issuecomment-1156256273

NielsRogge avatar Jul 19 '22 09:07 NielsRogge

Hi @photopea, thanks for reaching out. I've forwarded your request to the team, someone will reach out :)

NielsRogge avatar Jul 19 '22 09:07 NielsRogge

hi i have prepared the dataframe like this:

image

now what should be the scrip for fine tuning ?

aditya11ad avatar Aug 10 '22 07:08 aditya11ad

Hi @aditya11ad
This might be helpful https://github.com/ruifcruz/sroie-on-layoutlm/blob/main/LayoutLM_fine_tunning_for_SROIE_dataset.ipynb

pavel-nesterov avatar Oct 20 '22 05:10 pavel-nesterov

Hi, Can we train layoutLM on custom dataset with .txt annotations files (yolo format annotations files) available on local machine?

PoonamS25 avatar Mar 14 '23 13:03 PoonamS25

Hi @PoonamS25, you'll probably need to convert them to the format that LayoutLM expects. Basically for each document you need a list of words, with corresponding bounding box coordinates and labels. Each bounding box needs to be in the format (x0, y0, x1, y1), where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.

NielsRogge avatar Apr 03 '23 11:04 NielsRogge

Hi,

You can create a regular PyTorch Dataset as follows:

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

         return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

@NielsRogge What about the Labels. Using OCR we can get words and bounding boxes. But you haven't mentioned anything about labels in it. I believe we need to generate somehow labels also and use them in it. Can you clarify whether we need those or not.

DANISHFAYAZNAJAR avatar Jun 28 '23 06:06 DANISHFAYAZNAJAR

@NielsRogge If I do LayoutLM training with custom invoice images, what should be the annotation format ? Should I use Q-A format like FUNSD (invoice_number_Q & invoice_number_A, date_Q & date_A etc.) or can I just annotate all labels directly like invoice_number, invoice_date, vendor_name etc ?

vcjayan avatar Jul 01 '23 07:07 vcjayan

@DANISHFAYAZNAJAR if you have labels at the word level (like the FUNSD dataset has), then you can do the following:

# get document image + corresponding words, boxes and labels at the word level
item = self.df.iloc[idx]
image = Image.open(self.root + ...).convert('RGB')
words = item.words
boxes = item.boxes
word_labels = item.ner_tags

# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")

# remove batch dimension which the processor adds by default
encoding = {k:v.squeeze() for k,v in encoding.items()}

return encoding

NielsRogge avatar Jul 28 '23 11:07 NielsRogge

@vcjayan you just need a list of words, their boxes and their labels for each document.

So this could look like:

words = ["hello", "world", "this", "is", "invoice", "number", '14721"]
boxes = [[1,2,3,4] for _ in range(len(words))]
word_labels = ["other", "other", "other", "other", "other", "other", "invoice_number"]

assuming you have 2 classes ("other" and "invoice_number")

NielsRogge avatar Jul 28 '23 11:07 NielsRogge

@NielsRogge How would we specify the train and test splits. I am using this - https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb notebook. I tried the above provided code the dataset is returned as dataset <main.CustomDataset at 0x7a1c979eead0> object whereas the expectation was a DatasetDict object - What am I missing? Please help

ankitarajsharma avatar Jul 31 '23 07:07 ankitarajsharma

You can create 2 instances, like so:

train_dataset = CustomDataset(dataset=dataset["train"], processor=processor)
val_dataset = CustomDataset(dataset=dataset["validation"], processor=processor)

NielsRogge avatar Jul 31 '23 07:07 NielsRogge

Hi,

You can create a regular PyTorch Dataset as follows:

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

         return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

@NielsRogge for this CustomDataset, does the return item in getitem method is only encoding or we can separately return return { "input_ids": torch.tensor(self.encoding["input_ids"][index], dtype=torch.int64), "attention_mask": torch.tensor(self.encoding["attention_mask"][index], dtype=torch.int64), "bbox": torch.tensor(self.encoding["bbox"], dtype=torch.int64), "pixel_values": torch.tensor(self.encoding['pixel_values'], dtype=torch.float32), "labels": torch.tensor(self.encoding['labels'], dtype=torch.int64) }

madhavi1102 avatar Aug 08 '23 12:08 madhavi1102

hi I created the annotations folder containing json files , like this:

{
    "form": [
        {
            "box": [
                84,
                109,
                136,
                119
            ],
            "text": "23456789",
            "label": "invoice_num",
            "words": [
                {
                    "box": [
                        84,
                        109,
                        136,
                        119
                    ],
                    "text": "23456789"
                }
            ]
.
.
.

please guide me, how can I train layoutlmV3 on this.

PLease tell me how did you generate the annotations?

Dhananjay-97 avatar Aug 30 '23 12:08 Dhananjay-97

hi i trained on form dataset annotated documents with 200 images model trained for 100 epochs but i am not getting any result while inferencing

Aesthethic0de avatar Oct 12 '23 17:10 Aesthethic0de

@NielsRogge Can you help me on creating a mapping between the predicted labels and tokens(words) associated. I tried extracting text from bounding box which was not accurate. can we create a direct mapping from the labels to the words. Many thanks in advance

vidya-chandran avatar Jan 22 '24 11:01 vidya-chandran

hi I created the annotations folder containing json files , like this:

{
    "form": [
        {
            "box": [
                84,
                109,
                136,
                119
            ],
            "text": "23456789",
            "label": "invoice_num",
            "words": [
                {
                    "box": [
                        84,
                        109,
                        136,
                        119
                    ],
                    "text": "23456789"
                }
            ]
.
.
.

please guide me, how can I train layoutlmV3 on this.

Hey can you guide me , how you prepared dataset in this format

yashakagf avatar Mar 28 '24 05:03 yashakagf