InternVL
InternVL copied to clipboard
LazySupervisedDataset的token_length计算
LazySupervisedDataset的token_length计算感觉存在问题。在线计算token_length的时候,第一次计算的token_length是input_id的长度。而保存到字典conv2length中的value却是input_id加上图片token的长度。
if self.group_by_length:
self.conv2length = {} # Using a dictionary to speed up token length calculation
self.length = []
for data_item in self.raw_data:
data_item = json.loads(data_item)
if 'length' in data_item:
token_length = data_item['length'] # Use precomputed length if available
else:
# Compute token length using the tokenizer
conversations = '\n'.join([temp['value'] for temp in data_item['conversations']])
str_length = len(conversations)
if str_length not in self.conv2length:
token_length = tokenizer(
conversations, return_tensors='pt', padding=False, truncation=False,
).input_ids.size(1)
self.conv2length[str_length] = token_length + num_image_token * (
max_dynamic_patch + use_thumbnail)
else:
token_length = self.conv2length[str_length]
self.length.append(token_length)