Transformers-Tutorials
Transformers-Tutorials copied to clipboard
How to create pytorch Dataset for layloutlmv2 from my custom images and json file.??
@NielsRogge thanks for this layoutlmv2 implemetation in HF.Actually i want to create the torch dataset from my custom images and json file (for now suppose data is FUNSD downloaded) please guide me how can i create this torch dataset so that i can give this data as input to LayoutLMv2Processor and apply map function.
This is what is tried
`import json
import os
import torch
from torch.utils.data import Dataset
from detectron2.data.detection_utils import read_image
from detectron2.data.transforms import ResizeTransform, TransformList
def normalize_bbox(box,size):
width, height = size[0],size[1]
return [
int(1000 * (box[0] / width)),
int(1000 * (box[1] / height)),
int(1000 * (box[2] / width)),
int(1000 * (box[3] / height)),
]
def load_image(image_path):
image = read_image(image_path, format="BGR")
h = image.shape[0]
w = image.shape[1]
img_trans = TransformList([ResizeTransform(h=h, w=w, new_h=224, new_w=224)])
image = torch.tensor(img_trans.apply_image(image).copy()).permute(2, 0, 1) # copy to make it writeable
return image, (w, h)
label2id = {'B-ANSWER': 5,
'B-HEADER': 1,
'B-QUESTION': 3,
'I-ANSWER': 6,
'I-HEADER': 2,
'I-QUESTION': 4,
'O': 0,
'B-O': 7}
class CustomTextDataset(Dataset):
def __init__(self, filepath):
self.filepath = filepath
def __len__(self):
return len(os.path.join(self.filepath, "annotations"))
def __getitem__(self, idx):
# logger.info("⏳ Generating examples from = %s", filepath)
ann_dir = os.path.join(self.filepath, "annotations")
img_dir = os.path.join(self.filepath, "images")
all_data = []
for guid, file in enumerate(sorted(os.listdir(ann_dir))):
# try:
tokens = []
bboxes = []
ner_tags = []
file_path = os.path.join(ann_dir, file)
with open(file_path, "r", encoding="utf8") as f:
data = json.load(f)
image_path = os.path.join(img_dir, file)
image_path = image_path.replace("json", "png")
image, size = load_image(image_path)
# print("here is the size variable",size)
for item in data["form"]:
words, label = item["words"], item["label"]
words = [w for w in words if w["text"].strip() != ""]
if len(words) == 0:
continue
if label == "other":
for w in words:
tokens.append(w["text"])
ner_tags.append(label2id["O"])
bboxes.append(normalize_bbox(w["box"], size))
else:
tokens.append(words[0]["text"])
ner_tags.append(label2id["B-" + label.upper()])
bboxes.append(normalize_bbox(words[0]["box"], size))
for w in words[1:]:
tokens.append(w["text"])
ner_tags.append(label2id["I-" + label.upper()])
bboxes.append(normalize_bbox(w["box"], size))
all_data.append({"id": str(guid), "tokens": tokens, "bboxes": bboxes, "ner_tags": ner_tags, "image_path": image_path})
# except:
# print("error")
sample = all_data[idx]
return sample`
By creating data this way i was getting while training like : TypeError: LayoutLMv2ForTokenClassification object argument after ** must be a mapping, not list
please help me out. thanks for solution in advance.``
You can refer huggingface documentation for creating a pre-processing file for preparing your dataset for lmv2. I referred the same for my own custom dataset. Below is the FUNSD-preprocessing file that i refrenced for preprocessing custom dataset. https://huggingface.co/datasets/nielsr/funsd/blob/main/funsd.py
You can check that out , this works for LMV2. Hope it helps.
@sheikhasim Thanks for your reply. But https://huggingface.co/datasets/nielsr/funsd/blob/main/funsd.py is downloading the funsd dataset and using that but my data is present in local how to load it using load_dataset of huggingface . what are the changes required ?
@KnitVikas If your dataset is already in required input format i.e. images + json then , There are two ways for that :
-
Store the zip file of your dataset in gdrive and create a pre-processing script in the hugging face datasets. Mention the path to the gdrive dataset.zip in download and extract command. So once you run load_dataset(name_of_hugging_face_script) It'll load dataset from the gdrive and pre-process.
-
Second way is to use the hugging face's method of loading the dataset locally. ........ from datasets import load_dataset dataset = load_dataset('PATH/TO/MY/LOADING/SCRIPT', data_files='PATH/TO/MY/FILE') ......... https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html