group_texts function: Why?

Open HossamAmer12 opened this issue 1 year ago • 1 comments

There is a data function called group_texts. I understand that this function concatenates the texts and creates blocks of text with specific block size. I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor? Could you please explain why you opted for this way?

Oct 20 '24 12:10 HossamAmer12

Hi Hossam,

I don't think I implemented this function myself, I think I copied it from Huggingface's language modeling example.

If I remember correctly, it's just more efficient than padding, since you can pack more documents in the same batch, and padding is basically a waste of compute.

Another thing I think this function does, if I remember correctly, is building the sliding window evaluation. This means that there are overlaps between document, but every token is predicted only once, and serves as context in the next chunk.

Best, Uri

On Sun, Oct 20, 2024 at 08:01 Hossam Amer @.***> wrote:

There is a data function called group_texts. I understand that this function concatenates the texts and creates blocks of text with specific block size. I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor? Could you please explain why you opted for this way?

— Reply to this email directly, view it on GitHub https://github.com/neulab/knn-transformers/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMHZ2TRAGUAMWDSBKDDZ4OLSHAVCNFSM6AAAAABQIOKWCSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYDAMZZGM3TQNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Oct 20 '24 17:10 urialon