TableTrainNet
TableTrainNet copied to clipboard
Train on custom data
Hi @mawanda-jun
First of all, what an incredible project you have created here! I read the TrainNet research paper and it seems like a very cool idea.
I have a question though - I get mixed results on my own invoices (shipping industry). I was wondering, can I train one of the existing models on my own dataset?
For example, if I have annotated a lot of invoices in below format:
filename class xmin ymin xmax ymax
my_invoice_page1.jpeg table 193 717 389 790
my_invoice_page2.jpeg table 220 940 362 997
Will I then be able to re-train one of the models and use it?
Hi and thank you :)First of all, you can definitely train the network on your own dataset, however the performance of the model depends mainly on the quantity of tables you have. Second, it is a quite old project and so I don't know if the pre-trained models and the TF framework are still working. Lastly, it is a proof-of-concept and therefore I must advise you it is not "business ready".If I were you, I'd take this project as reference and try to make my own with updated libraries and new data - I'd definitely look at this dataset: https://github.com/doc-analysis/TableBankI think that the steps I followed are quite straightforward, however the code becomes obsolete really soon and I'm not maintaining it anymore.Tell me if you have any other questions.Have a nice day,GiovanniIl 10 Feb 2020 10:32, Oliver Busk Jensen [email protected] ha scritto:Hi @mawanda-jun First of all, what an incredible project you have created here! I read the TrainNet research paper and it seems like a very cool idea. I have a question though - I get mixed results on my own invoices (shipping industry). I was wondering, can I train one of the existing models on my own dataset? For example, if I have annotated a lot of invoices in below format: filename class xmin ymin xmax ymax my_invoice_page1.jpeg table 193 717 389 790 my_invoice_page2.jpeg table 220 940 362 997
Will I then be able to re-train one of the models and use it?
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
I tried using your project "IntelligentOCR" and got it up and running. It actually did a pretty good job, however, it's clear to see that it detect tables found in academic papers better than tables in invoices for example. (Hence why I want to train it on my own dataset).
I have 2000 invoices - all containing tables, that I wish to train a new model on.
I was thinking something like this:
- Annotate the 2000 invoices according to the CSV format like above. (filename, class, xmin, ymin, xmax, ymax)
- Split the dataset into "training" and "test"
- Train the model
Yes, I think it would work just fine. Actually, you can also train the network on my original dataset and then use the resulting model as a pre-training task over your own dataset. You definitely should divide your dataset into training and test. Since you have 2K examples, I'd divide it into 70%-30% to have a good representation at test time.Consider to change also the parameters of the blurring of the images, since I think the invoices has more "sparse" tables, am I right?Il 10 Feb 2020 12:20, Oliver Busk Jensen [email protected] ha scritto:I tried using your project "IntelligentOCR" and got it up and running. It actually did a pretty good job, however, it's clear to see that it detect tables found in academic papers better than tables in invoices for example. (Hence why I want to train it on my own dataset). I have 2000 invoices - all containing tables, that I wish to train a new model on. I was thinking something like this: Annotate the 2000 invoices according to the CSV format like above. (filename, class, xmin, ymin, xmax, ymax)Split the dataset into "training" and "test"Train the model
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for your quick reply! Very much appreciated.
In regards to:
Consider to change also the parameters of the blurring of the images
What do you mean with this? Where do I find this parameter regarding blurring - and why does it matter?
since I think the invoices has more "sparse" tables, am I right?
You most definitely are! These invoice tables don't have any clear column/row separating lines, but is still presented in a "table-like/row-like" list.
Oh, I’m sorry. I thought I implemented it, but I didn’t entirely actually. I am referring to thishttps://www.researchgate.net/publication/320243569_Table_Detection_Using_Deep_Learning paper, in which they made a transformation of the images in order to let the pre-trained-on-normal-images network to adapt to the sparse, b/w documents with tables.
I thought I implemented it entirely, but I found only a b/w version of this transformation at thishttps://github.com/mawanda-jun/TableTrainNet/blob/6b3cee8ed0250d8cd52b374c76597a70121c398c/dataset/img_to_jpeg.py#L22 line.
I think that I didn’t upload that change because it would have involved RGB images, which were far too heavy for my poor laptop.
However, to implement it, there are few changes to be done: you have to change that function, to look for every time the third dimension of images is involved and change it from “1” to “3”. But there is some work to do, and maybe you’re not interested in doing it. :D
Da: Oliver Busk Jensenmailto:[email protected] Inviato: lunedì 10 febbraio 2020 14:23 A: mawanda-jun/TableTrainNetmailto:[email protected] Cc: Giovanni Cavallinmailto:[email protected]; Mentionmailto:[email protected] Oggetto: Re: [mawanda-jun/TableTrainNet] Train on custom data (#5)
Thanks for your quick reply! Very much appreciated.
In regards to:
Consider to change also the parameters of the blurring of the images
What do you mean with this? Where do I find this parameter regarding blurring - and why does it matter?
since I think the invoices has more "sparse" tables, am I right? You most definitely are! These invoice tables don't have any clear column/row separating lines, but is still presented in a "table-like/row-like" list.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmawanda-jun%2FTableTrainNet%2Fissues%2F5%3Femail_source%3Dnotifications%26email_token%3DAI3WBIYJRCFLIDVOMHGFZFDRCFIOFA5CNFSM4KSKR772YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIPOGY%23issuecomment-584120091&data=02%7C01%7C%7Cc56050ca191f4015f16b08d7ae2c753b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637169378273641620&sdata=5EE5pS6K0k9mwWH8QbZY8KnfVKGF4iRQRyN0QueDLKQ%3D&reserved=0, or unsubscribehttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAI3WBI2JH53B3KFOBQJL6S3RCFIOFANCNFSM4KSKR77Q&data=02%7C01%7C%7Cc56050ca191f4015f16b08d7ae2c753b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637169378273655115&sdata=QUaRRdwKt69yKnIwLS%2FaR3r6orgwQCFWNhuV%2FV%2B5fDk%3D&reserved=0.