transformers
transformers copied to clipboard
[New Model] DocFormer: End-to-End Transformer for Document Understanding
🌟 New model addition
Model description
See "DocFormer: End-to-End Transformer for Document Understanding", Appalaraju et al (ICCV 2021) on CVF and arXiv
DocFormer is a multi-modal transformer model for 2D/visual documents from Amazon (where, fair disclosure, I also currently work but not in research) - which I would characterize at a high level as being broadly along the same use cases as LayoutLMv2 (already in transformers
), but achieving better (state-of-the-art) results with smaller datasets per the benchmarks in the paper.
I've found this kind of multi-modal, spatial/linguistic model very useful in the past (actually released an AWS sample and blog post with Hugging Face LayoutLMv1 earlier this year) and would love the improvements from DocFormer could be available through HF Transformers.
Open source status
- [X] the model implementation is available: (give details)
- Looks like there's an (MIT-0) implementation at https://github.com/shabie/docformer
- [ ] the model weights are available: (give details)
- Not currently as far as I can tell?
- [X] who are the authors: (mention them, if possible by @gh-username)
- Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha - all of AWS AI. Not sure of GitHub usernames
- @shabie for the currently available implementation
Haha thank you for this issue! Tagging @uakarsh since both of us have managed to get the architecture largely down (we think!)
It would be awesome to get this integrated with some help :)
Directly inspired by the journey of @NielsRogge
@shabie Thanks for the tag. @athewsey, as far as the weights are concerned, I have tried implementing their MLM task (described in the repo), as well as Image Reconstruction Part (for the Unsupervised Case), and on the basis of the performance, I can say that it is working nearly close to that of the paper. So, we are hoping to release it as soon as possible. I am quite excited to share the model with the community since this is my first transformer(along with @shabie) implementation and nothing can be more excited than this. However, there are some approximations in the model, which may affect performance, but we would try to get the results as close as possible. Cheers,
Hi,
DocFormer would indeed be a great addition to the library. Note that pretrained weights are required for a model to be added.
Looking forward to this!
Hi,
DocFormer would indeed be a great addition to the library. Note that pretrained weights are required for a model to be added.
Looking forward to this!
@NielsRogge Thank you for the quick reply!
Its very clear to us that weights are needed. That's the reason we didn't create this new model issue so far. That is not to say that wasn't a good idea @athewsey!
So the two challenges in getting weights is compute and data.
Compute may be manageable but the main problem right now is that the OCR to be performed to extract words and their bounding boxes on RVL-CDIP dataset. The thing is pytesseract
is ridiculously slow. I think pytesseract
is just generally a poor implementation given its disk bound operations.
I didn't get the chance earlier but I was about to ask you if you guys have the dataset with OCR step completed and if that could also be made available. That would speed up things a lot. If not, we'd have to first overcome this hurdle which is where we're at basically. We'd need some kind of distributed computation (like a spark cluster job) to get this task completed in manageable time.
As an update, the authors would be sharing the Textract OCR for the RVL CDIP Dataset, and as soon as they release it, we would try to achieve the benchmark performance as mentioned in the paper. However, we are also trying from our end, to make our own OCR part, and then perfrom pre train and fine tune
Any updates on this?
Have completed the scripts for pre-training on MLM, and using DocFormer for Document Image Classification. Check it out here DocFormer Examples with PyTorch-Lightening
Any updates on this? It would be very useful @uakarsh @shabie @athewsey @NielsRogge . LayoutLMV3 is cool but license doesn't allow commercial usage
Hi @WaterKnight1998 we have been able to train the model, you can find it here.
The list of things done till now are:
- [x] Pre-training script for DocFormer on any dataset either using Tesseract (means no OCR provided), or you can give OCR through any suitable tool
- [x] Training from Scratch/ Fine-tuning DocFormer on any dataset, you can check out the link, I mentioned above
- [ ] Got the same results as that of the authors
Due to limited resources, currently, I have been able to make the first two points and tried to show a demo of the same here and if @NielsRogge suggests, we can indeed integrate it with Hugging Face, since it would be easy to do so
Thanks,
@uakarsh I can help if you need help. Can this model be used for token classification?
Sure, with some modifications to the script of Document Image Classification and pre-processing, we would definitely be able to use it for token classification
Hello there, @uakarsh. Has this initiative of integrating DocFormer into Transformers been discontinued in the meantime?
Hi @vprecup, thanks for your comment, it really made me feel happy that, you are interested in integrating DocFormer into hugging face. However, the problem is, as a student, I don't have that much computing to pre-train the model. As mentioned in the paper, they took 5M documents (pg. 6, above section 4.2), and have not specified the data. I believe the current IDL Dataset would be sufficient for the pre-train dataset, and we have a demo notebook for pre-training.
So, maybe if somebody can do that, I can help them.
By the way, one interesting thing, In the DocFormer paper, on pg. 7, Table 6, without pre-training, the authors get an F1 Score of 4.18
on FUNSD (100 Epochs), while in our notebook, we get 13.29
(3x improvement on 100 Epochs), and it overfits, so maybe the implementation is good to go for your use case.
Thanks, Akarsh
Hi @uakarsh, if we could get you some compute power, would you like to give it a go?
It seems I can borrow a Z8 Fury workstation from HP, equipped with up to four of the latest NVIDIA RTX 6000 Ada generation GPUs, each boasting 48GB of VRAM. Additionally, it features Intel's most powerful CPU, potentially with up to 56 cores, and the option to be fully loaded with 2TB of RAM.
Creating the weights for the DocFormer should be a good use of this machine. What is your time availability?
Hi @mbertani, sorry for the late reply. If it is possible, I would surely like to give it a go. As of my experience with GPUs, I have worked on a DGX workstation, and I believe, the configurations you mentioned would work fine.
By time availability, do you mean to have a meet to discuss the plan further?
By that time, I would be working on arranging the code required for the pre-training as well as coming up with the plan about how to go next. I do have slight experience on pre-training (had pre-trained LayoutLMv3, and some related models for use case), so I can plan things and test them.
OK, good, then we can setup a meeting to discuss how we proceed. So as not to share emails on public forums, I can share with you my LI profile and we take it from there?
https://www.linkedin.com/in/marcobertaniokland/
Sure
Any update on this?