[Bug] concat_pad_data_collator的pad_id为0可能有问题

Open MrChen314 opened this issue 10 months ago • 0 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

在internvl2-76B中，tokenizer会把英文的'!'对应0，如果concat_pad_data_collator的pad_id默认是0，这行代码feat['attention_mask'] = feat['input_ids'].ne(pad_id)，会把‘!’所在位置的attention_mask设为false，但本应是true

Reproduction

def pad_data_collator(features, pad_id=0):

first = features[0]
batch = {}

batch_lens = [feat['input_ids'].shape for feat in features]
max_item_length = max(batch_lens)[0]
for idx in range(len(features)):
    feat = features[idx]
    temp_input_ids = torch.LongTensor([pad_id] * max_item_length)
    temp_input_ids[:feat['input_ids'].shape[0]] = feat['input_ids']
    feat['input_ids'] = temp_input_ids
    temp_labels = torch.LongTensor([IGNORE_INDEX] * max_item_length)
    temp_labels[:feat['labels'].shape[0]] = feat['labels']
    feat['labels'] = temp_labels
    feat['attention_mask'] = feat['input_ids'].ne(pad_id)

Environment

默认

Error traceback

Feb 19 '25 11:02 MrChen314