DocBank How do you train with those NOT-TEXT elements.

How do you train with those NOT-TEXT elements.

Open linan142857 opened this issue 4 years ago • 4 comments

Dear author, For some documents that contain massive not-text elements, such as hundreds of thousands of "##LTLine##". How do you deal with them actually? For example, you try to train&predict all those elements with text '##LTLine##'.

Thank you!

Nov 09 '20 08:11 linan142857

Yes. We regard '##LTLine##' as a special token during train and predict.

Nov 09 '20 08:11 liminghao1630

Yes. We regard '##LTLine##' as a special token during train and predict.

Hi! Could you please tell integer identifiers of ##LTLine## and ##LTFigure## tokens within LayoutLM's vocabulary?

Thanks

Mar 26 '21 14:03 NandreyN

In fact, we did not add them to the vocabulary. They will also be tokenized into tokens and labeled in the way I mentioned at #25.

Apr 16 '21 06:04 liminghao1630

Thanks

Apr 16 '21 07:04 NandreyN

DocBank DocBank copied to clipboard

How do you train with those NOT-TEXT elements.

DocBank
DocBank copied to clipboard