fairseq Pretrain with "sample-break-mode=complete_doc" and got "Assertion `srcIndex

Pretrain with "sample-break-mode=complete_doc" and got "Assertion `srcIndex < srcSelectDimSize` failed."

Open Once2gain opened this issue 2 years ago • 0 comments

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I followed examples/roberta/README.pretraining.md and wanted to continue pretraining in task-specific dataset. I build dataset.raw and dataset.bpe, and followed "each document should be separated by an empty line"

xxxxx... xxxxx...

xxxxx...

xxxxx...

When I set sample-break-mode=complete_doc/complete and begin to train, fairseq throws "Assertion srcIndex < srcSelectDimSize failed." and "RuntimeError: CUDA error: device-side assert triggered."

I read the source code in TokenBlockDataset and fairseq/data/token_block_utils_fast.pyx and I realized it dosen't check every line's size when breaking tokens into block size.

Here's Line 67-69 in token_block_utils_fast.pyx:

while sz_idx < len(sizes_view):
    if curr_size + sizes_view[sz_idx] <= block_size or curr_size == 0:
        curr_size += sizes_view[sz_idx]

When (curr_size == 0 and sizes_view[sz_idx] > block_size), tokens in the [sz_idx]-line are still put in one block, I wonder if it is proper or if I missed something?

What's your environment?

fairseq Version (0.12.2):
PyTorch Version (2.0.1)

May 28 '23 07:05 Once2gain

fairseq fairseq copied to clipboard

Pretrain with "sample-break-mode=complete_doc" and got "Assertion `srcIndex < srcSelectDimSize` failed."

❓ Questions and Help

Before asking:

What is your question?

What's your environment?

fairseq
fairseq copied to clipboard