unitable dataset Annotation

I want fine tune the unitable model for my custom dataset...How to do the annotaion process is any tool available for ur annotation methods.. @matthewdhull @polochau @haekyu @helblazer811 @ShengYun-Peng

May 25 '24 06:05 tzktok

Hi @tzktok, thanks for your interest! As stated in the paper, we used publicly available datasets while training UniTable. I will share the papers of these datasets below and their annotation processes may be helpful to you!

PubTabNet: https://github.com/ibm-aur-nlp/PubTabNet SynthTabNet: https://arxiv.org/abs/2203.01017 FinTabNet: https://developer.ibm.com/exchanges/data/all/fintabnet/

May 25 '24 19:05 ShengYun-Peng

Hi @tzktok, thanks for your interest! As stated in the paper, we used publicly available datasets while training UniTable. I will share the papers of these datasets below and their annotation processes may be helpful to you!

PubTabNet: https://github.com/ibm-aur-nlp/PubTabNet SynthTabNet: https://arxiv.org/abs/2203.01017 FinTabNet: https://developer.ibm.com/exchanges/data/all/fintabnet/

I have used my own data to fine-tune the model, and the results have been very good. Thank you for your efforts. However, the inference speed does not meet my requirements. Are there any good methods to speed up inference? I have tried using TensorRT, but the improvement was not significant. Should I consider adding a KV cache to reduce the time spent on inference?

May 26 '24 06:05 whalefa1I

Glad to know the finetuning went well! Yes, UniTable was implemented with vanilla transformer architecture. A kv-cache like the llama3 architecture here will largely speed up the inference. Interested in opening a PR?

May 26 '24 14:05 ShengYun-Peng

Glad to know the finetuning went well! Yes, UniTable was implemented with vanilla transformer architecture. A kv-cache like the llama3 architecture here will largely speed up the inference. Interested in opening a PR?

I will try to add this part, and when all goes well I will submit the pr～

May 26 '24 14:05 whalefa1I

Thanks! I would recommend starting from implementing the kv-cache logic in the pipeline notebook and compare speed.

May 26 '24 14:05 ShengYun-Peng

Hi @tzktok, thanks for your interest! As stated in the paper, we used publicly available datasets while training UniTable. I will share the papers of these datasets below and their annotation processes may be helpful to you! PubTabNet: https://github.com/ibm-aur-nlp/PubTabNet SynthTabNet: https://arxiv.org/abs/2203.01017 FinTabNet: https://developer.ibm.com/exchanges/data/all/fintabnet/

I have used my own data to fine-tune the model, and the results have been very good. Thank you for your efforts. However, the inference speed does not meet my requirements. Are there any good methods to speed up inference? I have tried using TensorRT, but the improvement was not significant. Should I consider adding a KV cache to reduce the time spent on inference?

How u annotate your own dataset ?

May 27 '24 06:05 tzktok

I'm also interested in training using my own dataset but have no idea where to start for annotating it. Any advice? I originally tried using the full_pipeline notebook but it did not create an accurate table from the image.

May 29 '24 15:05 pincusz

I also wants to train with custom dataset. could you please share custom dataset preparation python file?

May 30 '24 05:05 lerndeep

@whalefa1I Could you please provide training script for unitable large for box,cell and contain train module size?

May 30 '24 07:05 lerndeep

@whalefa1I May I ask how much data did you use to train in your scenario?

May 31 '24 01:05 Sanster

@whalefa1I could you please share the custom dataset preparation script?

Jun 03 '24 23:06 lerndeep

@whalefa1I May I ask how much data did you use to train in your scenario?

30k maybe？Only Bbox model~

Jun 04 '24 01:06 whalefa1I

@whalefa1I Could you please provide training script for unitable large for box,cell and contain train module size?

Maybe as long as you find the corresponding option in the CONFIG.mk file and configure it when running the Makefile with the exp name [EXP_$*], it should work, right? Do you want to convert it into a regular training script instead of using Hydra for configuration?

Jun 04 '24 01:06 whalefa1I

@whalefa1I could you please share the custom dataset preparation script?

Our data annotation format differs from the open-source TSR task annotation method, but both are composed of two coordinate points.

final_label_dataset = []
# data_from_platform is a json file labeld by Labelme
for data_from_platform in tqdm(data_from_platform_list):
    tmp_bbox_label = {}
    tmp_bbox_label['filename'] = data_from_platform["imagePath"]
    tmp_bbox_label['split'] = 'train'
    shapes = data_from_platform["shapes"]
    cells = []
    for sh in shapes:
        label = sh["label"]
        points = sh["points"]
        points = [int(points[0][0]), int(points[0][1]), int(points[2][0]), int(points[2][1])]
        cells.append({"tokens": label, "bbox": points})
        tmp_bbox_label['cells'] = cells
    final_label_dataset.append(tmp_bbox_label)

with open(r'./train_data4unitable.json', 'w') as file:
    for data in final_label_dataset :
        file.write(json.dumps( data ) + '\n' )

Jun 04 '24 01:06 whalefa1I

@whalefa1I

@whalefa1I could you please share the custom dataset preparation script?

Our data annotation format differs from the open-source TSR task annotation method, but both are composed of two coordinate points.

final_label_dataset = []
# data_from_platform is a json file labeld by Labelme
for data_from_platform in tqdm(data_from_platform_list):
    tmp_bbox_label = {}
    tmp_bbox_label['filename'] = data_from_platform["imagePath"]
    tmp_bbox_label['split'] = 'train'
    shapes = data_from_platform["shapes"]
    cells = []
    for sh in shapes:
        label = sh["label"]
        points = sh["points"]
        points = [int(points[0][0]), int(points[0][1]), int(points[2][0]), int(points[2][1])]
        cells.append({"tokens": label, "bbox": points})
        tmp_bbox_label['cells'] = cells
    final_label_dataset.append(tmp_bbox_label)

with open(r'./train_data4unitable.json', 'w') as file:
    for data in final_label_dataset :
        file.write(json.dumps( data ) + '\n' )

using this you train for cell detection and content recognition right?
Have you did pertaining or only fine-tuning?

In my case table have around 1000 cell so I don't know it will be good to fine-tune only by increase maxlen only work fine or not?

Jun 04 '24 01:06 lerndeep

Thanks! I would recommend starting from implementing the kv-cache logic in the pipeline notebook and compare speed.

It seems that because the decoder has only 4 layers or there may be an error in my implementation, the acceleration effect is not significant, achieving only a 7% speedup (varying with the number of bboxes). Due to the differences between the custom implementation of attention and the native torch attention (the MAE loss of the two types of attention is below e-8 in the first layer, but increases to 0.9 after subsequent cross-attention), it may be necessary to retrain the model. Additionally, I have replaced components using the llama decoder. If you are interested, I can send it to you.

Jun 04 '24 02:06 whalefa1I

@whalefa1I could you please share the custom dataset preparation script?

Our data annotation format differs from the open-source TSR task annotation method, but both are composed of two coordinate points.

final_label_dataset = []
# data_from_platform is a json file labeld by Labelme
for data_from_platform in tqdm(data_from_platform_list):
    tmp_bbox_label = {}
    tmp_bbox_label['filename'] = data_from_platform["imagePath"]
    tmp_bbox_label['split'] = 'train'
    shapes = data_from_platform["shapes"]
    cells = []
    for sh in shapes:
        label = sh["label"]
        points = sh["points"]
        points = [int(points[0][0]), int(points[0][1]), int(points[2][0]), int(points[2][1])]
        cells.append({"tokens": label, "bbox": points})
        tmp_bbox_label['cells'] = cells
    final_label_dataset.append(tmp_bbox_label)

with open(r'./train_data4unitable.json', 'w') as file:
    for data in final_label_dataset :
        file.write(json.dumps( data ) + '\n' )

Thank you for your sharing. have you train for table structure part or not? if yes how you labeled dataset at HTML format where colspan rowspan are presented?

Jun 04 '24 02:06 lerndeep

@whalefa1I
@whalefa1I could you please share the custom dataset preparation script?

Our data annotation format differs from the open-source TSR task annotation method, but both are composed of two coordinate points.
final_label_dataset = []
# data_from_platform is a json file labeld by Labelme
for data_from_platform in tqdm(data_from_platform_list):
    tmp_bbox_label = {}
    tmp_bbox_label['filename'] = data_from_platform["imagePath"]
    tmp_bbox_label['split'] = 'train'
    shapes = data_from_platform["shapes"]
    cells = []
    for sh in shapes:
        label = sh["label"]
        points = sh["points"]
        points = [int(points[0][0]), int(points[0][1]), int(points[2][0]), int(points[2][1])]
        cells.append({"tokens": label, "bbox": points})
        tmp_bbox_label['cells'] = cells
    final_label_dataset.append(tmp_bbox_label)

with open(r'./train_data4unitable.json', 'w') as file:
    for data in final_label_dataset :
        file.write(json.dumps( data ) + '\n' )
using this you train for cell detection and content recognition right?

Have you did pertaining or only fine-tuning?

In my case table have around 1000 cell so I don't know it will be good to fine-tune only by increase maxlen only work fine or not?

This is an interesting issue. I am currently using the llama decoder to reproduce the model, and its special positional encoding might have some capability for length-extension. However, for your case, I think it might be difficult. The out-of-distribution (OOD) phenomenon is likely to be significant, and you may need more data to support 4k token output.

Jun 04 '24 02:06 whalefa1I

@whalefa1I could you please share the custom dataset preparation script?

Our data annotation format differs from the open-source TSR task annotation method, but both are composed of two coordinate points.
final_label_dataset = []
# data_from_platform is a json file labeld by Labelme
for data_from_platform in tqdm(data_from_platform_list):
    tmp_bbox_label = {}
    tmp_bbox_label['filename'] = data_from_platform["imagePath"]
    tmp_bbox_label['split'] = 'train'
    shapes = data_from_platform["shapes"]
    cells = []
    for sh in shapes:
        label = sh["label"]
        points = sh["points"]
        points = [int(points[0][0]), int(points[0][1]), int(points[2][0]), int(points[2][1])]
        cells.append({"tokens": label, "bbox": points})
        tmp_bbox_label['cells'] = cells
    final_label_dataset.append(tmp_bbox_label)

with open(r'./train_data4unitable.json', 'w') as file:
    for data in final_label_dataset :
        file.write(json.dumps( data ) + '\n' )
Thank you for your sharing. have you train for table structure part or not? if yes how you labeled dataset at HTML format where colspan rowspan are presented?

This is related to our annotation format. We generate HTML tags from bbox annotations using a set of heuristic rules, so the entire process only requires a bbox model.

Jun 04 '24 02:06 whalefa1I

@whalefa1I could you please share the custom dataset preparation script?

Our data annotation format differs from the open-source TSR task annotation method, but both are composed of two coordinate points.
final_label_dataset = []
# data_from_platform is a json file labeld by Labelme
for data_from_platform in tqdm(data_from_platform_list):
    tmp_bbox_label = {}
    tmp_bbox_label['filename'] = data_from_platform["imagePath"]
    tmp_bbox_label['split'] = 'train'
    shapes = data_from_platform["shapes"]
    cells = []
    for sh in shapes:
        label = sh["label"]
        points = sh["points"]
        points = [int(points[0][0]), int(points[0][1]), int(points[2][0]), int(points[2][1])]
        cells.append({"tokens": label, "bbox": points})
        tmp_bbox_label['cells'] = cells
    final_label_dataset.append(tmp_bbox_label)

with open(r'./train_data4unitable.json', 'w') as file:
    for data in final_label_dataset :
        file.write(json.dumps( data ) + '\n' )
Thank you for your sharing. have you train for table structure part or not? if yes how you labeled dataset at HTML format where colspan rowspan are presented?
This is related to our annotation format. We generate HTML tags from bbox annotations using a set of heuristic rules, so the entire process only requires a bbox model.

could you please let me know the process or code of heuristic rules to generate HTML from labelme json format?

it will be really helpful for me.

Jun 04 '24 02:06 lerndeep

@whalefa1I May I ask how much data did you use to train in your scenario?

30k maybe？Only Bbox model~

Thank you for your reply, I would also like to ask you a question, in your scenario, what are the advantages of using unitable, which obtains bbox coordinates through autoregressive methods, compared to using object detection models (such as YOLO)?

BTW, I added a decoder with kv-cache in this PR https://github.com/poloclub/unitable/pull/11, which can achieve about a 30% improvement in inference speed with batch_size=1.

Jun 04 '24 06:06 Sanster

@whalefa1I May I ask how much data did you use to train in your scenario?

30k maybe？Only Bbox model~

Thank you for your reply, I would also like to ask you a question, in your scenario, what are the advantages of using unitable, which obtains bbox coordinates through autoregressive methods, compared to using object detection models (such as YOLO)?

BTW, I added a decoder with kv-cache in this PR #11, which can achieve about a 30% improvement in inference speed with batch_size=1.

Intuitively, direct object detection might not yield good results due to the presence of wireless tables or merged cells. Therefore, I have not trained a direct object detection model, but I am currently exploring related projects. This project has inspired me to modify the data annotation format, thereby reducing model calls. I have also compared other open-sourced tsr model and believe that the pretrained effects of unitable might be well transferred to my own dataset.
Thank you for your PR on kv cache. May I ask if you are able to achieve the same effects as the original weights? I suspect there might be an issue with my implementation, as I have obtained inconsistent outputs and results compared to yours.

Jun 04 '24 08:06 whalefa1I

@whalefa1I May I ask how much data did you use to train in your scenario?

30k maybe？Only Bbox model~

Thank you for your reply, I would also like to ask you a question, in your scenario, what are the advantages of using unitable, which obtains bbox coordinates through autoregressive methods, compared to using object detection models (such as YOLO)? BTW, I added a decoder with kv-cache in this PR #11, which can achieve about a 30% improvement in inference speed with batch_size=1.

Intuitively, direct object detection might not yield good results due to the presence of wireless tables or merged cells. Therefore, I have not trained a direct object detection model, but I am currently exploring related projects. This project has inspired me to modify the data annotation format, thereby reducing model calls. I have also compared other open-sourced tsr model and believe that the pretrained effects of unitable might be well transferred to my own dataset.

Thank you for your PR on kv cache. May I ask if you are able to achieve the same effects as the original weights? I suspect there might be an issue with my implementation, as I have obtained inconsistent outputs and results compared to yours.

I checked the results of the images in the dataset/mini_pubtabnet/val directory through full_pipeline.ipynb, and based on the visualization results, the output is the same as the original model.

Jun 04 '24 09:06 Sanster

Hey @whalefa1I I'm wondering if you can assist.

I have a dataset that comprises of PDFs with matching XML in SVG tag format that is D3.js derived.

I have bbox and tokens for all the text, but since the images have to be resized, how do I ensure that the existing annotations will correspond with the downsampled images when fine-tuning?

Is the SVG tag structure useful? Would I need to add the SVG tags to the existing HTML vocab file?

Also, some tables overflow into different pages. When converting pdf2image, how can I maintain consistency of box locations for each image to source PDF?

Jul 24 '24 12:07 xuzmocode4-325

Hey @whalefa1I I'm wondering if you can assist.

I have a dataset that comprises of PDFs with matching XML in SVG tag format that is D3.js derived.

I have bbox and tokens for all the text, but since the images have to be resized, how do I ensure that the existing annotations will correspond with the downsampled images when fine-tuning?

Is the SVG tag structure useful? Would I need to add the SVG tags to the existing HTML vocab file?

Also, some tables overflow into different pages. When converting pdf2image, how can I maintain consistency of box locations for each image to source PDF?

Could you please share some samples of your dataset so that I can see if they can be converted into the data format for my training?
Since I have not finetuned the html model and content model, I don't know if this will help, but I tried to add the tag “border=1” of the wired/wireless table to the html tag in the early months. This requires adding the tag to the vocab.json file, and it works, so if you want the html model to generate related tokens, you can consider adding SVG tags to vocab file;
Sorry, because I used PDF files to convert them into images and obtained the table area through the document layout analysis model. The cross-page tables were merged through specific business logic, so I did not consider the table merging logic in the general PDF scenario.

Jul 24 '24 15:07 whalefa1I

201124.pdf Sample Log SVG

Hey @whalefa1I

Could you please share some samples of your dataset so that I can see if they can be converted into the data format for my training?

Sure. I've shared a sample PDF with matching XML doc (SVG tag).

Since I have not finetuned the html model and content model, I don't know if this will help, but I tried to add the tag “border=1” of the wired/wireless table to the html tag in the early months. This requires adding the tag to the vocab.json file, and it works, so if you want the html model to generate related tokens, you can consider adding SVG tags to vocab file;

Thanks, will try it this out.

Jul 25 '24 11:07 xuzmocode4-325

Hi, have you trained bbox with your own dataset? Can you share the specific steps?

@whalefa1I May I ask how much data did you use to train in your scenario?

30k maybe？Only Bbox model~

Hi, have you trained bbox with your own dataset? Can you share the specific steps?

Jul 29 '24 09:07 num3num

201124.pdf 示例日志 SVG

嘿

您能否分享一些数据集的样本，以便我看看它们是否可以转换为用于训练的数据格式？

确定。我分享了一个带有匹配的 XML 文档（SVG 标签）的示例 PDF。

由于我没有对 html 模型和内容模型进行微调，我不知道这是否有帮助，但我在前几个月尝试将有线/无线表的标签“border=1”添加到 html 标签中。这需要将标签添加到 vocab.json 文件中，并且有效，因此如果希望 html 模型生成相关的 token，可以考虑在词汇文件中添加 SVG 标签;

谢谢，会试试这个。

请问你所给的两个示例文件有训练吗

Nov 06 '25 10:11 Wangzhongxi

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

Nov 06 '25 10:11 num3num