fairseq
fairseq copied to clipboard
Pretrain with "sample-break-mode=complete_doc" and got "Assertion `srcIndex < srcSelectDimSize` failed."
❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
I followed examples/roberta/README.pretraining.md and wanted to continue pretraining in task-specific dataset. I build dataset.raw and dataset.bpe, and followed "each document should be separated by an empty line"
xxxxx... xxxxx...
xxxxx...
xxxxx...
When I set sample-break-mode=complete_doc/complete and begin to train, fairseq throws "Assertion srcIndex < srcSelectDimSize failed." and "RuntimeError: CUDA error: device-side assert triggered."
I read the source code in TokenBlockDataset and fairseq/data/token_block_utils_fast.pyx and I realized it dosen't check every line's size when breaking tokens into block size.
Here's Line 67-69 in token_block_utils_fast.pyx:
while sz_idx < len(sizes_view):
if curr_size + sizes_view[sz_idx] <= block_size or curr_size == 0:
curr_size += sizes_view[sz_idx]
When (curr_size == 0 and sizes_view[sz_idx] > block_size), tokens in the [sz_idx]-line are still put in one block, I wonder if it is proper or if I missed something?
What's your environment?
- fairseq Version (0.12.2):
- PyTorch Version (2.0.1)