InternVL
InternVL copied to clipboard
[Bug] concat_pad_data_collator的pad_id为0可能有问题
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
在internvl2-76B中,tokenizer会把英文的'!'对应0,如果concat_pad_data_collator的pad_id默认是0,这行代码feat['attention_mask'] = feat['input_ids'].ne(pad_id),会把‘!’所在位置的attention_mask设为false,但本应是true
Reproduction
def pad_data_collator(features, pad_id=0):
first = features[0]
batch = {}
batch_lens = [feat['input_ids'].shape for feat in features]
max_item_length = max(batch_lens)[0]
for idx in range(len(features)):
feat = features[idx]
temp_input_ids = torch.LongTensor([pad_id] * max_item_length)
temp_input_ids[:feat['input_ids'].shape[0]] = feat['input_ids']
feat['input_ids'] = temp_input_ids
temp_labels = torch.LongTensor([IGNORE_INDEX] * max_item_length)
temp_labels[:feat['labels'].shape[0]] = feat['labels']
feat['labels'] = temp_labels
feat['attention_mask'] = feat['input_ids'].ne(pad_id)
Environment
默认
Error traceback