DocBank icon indicating copy to clipboard operation
DocBank copied to clipboard

How to treat the document that contains more than 512 words?

Open persistforever opened this issue 5 years ago • 2 comments

For the document that contains more than 512 words, how do you split the data? I have two ideas:

For example, if a document contains 5 words: ABCDE. We assume the window size equals to 2.

  1. It can be split into three independent documents and each document is 'AB', 'CD' and 'E', respectively. However, the problem is that these three documents are independent, which may obtain lower performance.
  2. It can be split into several documents via sliding windows. For example, with a window size of 3 words and padding of 1 word, the document can be split into five documents and each document is 'AB', 'ABC', 'BCD', 'CDE', 'DE', respectively. For 'BCD', the B and D are padding and the target word is C.

Do you use one of the above methods or other methods?

Thank you!

persistforever avatar Oct 13 '20 09:10 persistforever

We use the first method and pad the incomplete sequence with the padding tokens.

liminghao1630 avatar Oct 15 '20 07:10 liminghao1630

Ok, thanks a lot!

persistforever avatar Oct 20 '20 02:10 persistforever