donut
donut copied to clipboard
How many minimum images required for training
Hello @SamSamhuns, @gwkrsrch, @VictorAtPL I have around 60 Images and custom 8 tokens, each image consist of 3-4 same key but different values and annotation format is like SROIE I have followed this link Converted my data to this structure and followed the converter script as mention in the blog
{
[{
"Name": "Tom",
"Buyer": "Conda",
"contact_number": "989898989898",
"alt_number": "55555555",
"Buyer_id": "9856321023"
},
{
"Name": "Hanks",
"Buyer": "Conda",
"contact_number": "99999999999",
"alt_number": "25823102",
"Buyer_id": "9856321024"
},
{
"Name": "Lita",
"Buyer": "Conda",
"contact_number": "4545858402",
"alt_number": "12121212121",
"Buyer_id": "9856321022"
}]
}
My metadata.jsonl
{"file_name": "1.png", "ground_truth": "{\"gt_parse\": [{\"Name\": \"Tom\", \"Buyer\": \"Conda\", \"contact_number\": \"989898989898\", \"alt_number\": \"55555555\", \"Buyer_id\": \"9856321023\"}, {\"Name\": \"Hanks\", \"Buyer\": \"Conda\", \"contact_number\": \"99999999999\", \"alt_number\": \"25823102\", \"Buyer_id\": \"9856321024\"}, {\"Name\": \"Lita\", \"Buyer\": \"Conda\", \"contact_number\": \"4545858402\", \"alt_number\": \"12121212121\", \"Buyer_id\": \"9856321022\"}]}"}
this is my config my images size is variable max (2205 X 1693) min (1755 X 779)
resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL
result_path: "/content/drive/MyDrive/results"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: ["/content/drive/MyDrive/my_VDU"] # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
warmup_steps: 300 # 800/8*30/10, 10%
num_training_samples_per_epoch: 800
max_epochs: 80
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 10
gradient_clip_val: 1.0
verbose: True
I have trained using this configuration to epoch's 300 ,200, 120,80 ,40, 20, but all the results were miss spell, number were wrong. don't know if i am doing something wrong or should I do some tweaks,or increase my training data I even tried to combine the synthdog 200 images data but no luck still results were miss spell
60 might be too small of a sample. I trained with around ~1200 images and it was better. If you have fewer images, train with fewer epochs and use a smaller learning rate than 3e-5. Your warmup_steps
should be higher around 800 since they recommend 10% of the total steps.
@SamSamhuns thanks for the feedback but is my annotation format correct i was getting error on utils assert "gt_parse" in ground_truth and isinstance(ground_truth["gt_parse"], dict)
Traceback (most recent call last):
File "train.py", line 152, in <module>
train(config)
File "train.py", line 88, in train
sort_json_key=config.sort_json_key,
File "/content/donut/donut/util.py", line 76, in __init__
assert "gt_parse" in ground_truth and isinstance(ground_truth["gt_parse"], dict)
AssertionError
This part would be helpful to you: https://github.com/clovaai/donut#data
If you have more than one ground truth per an image, please use gt_parses
not gt_parse
.
Hope this helps :)
@SamSamhuns is there any explanation for warmup_steps
formula? I have 4790 data to be trained