InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

LazySupervisedDataset的token_length计算

Open HalvesChen opened this issue 2 months ago • 0 comments

LazySupervisedDataset的token_length计算感觉存在问题。在线计算token_length的时候,第一次计算的token_length是input_id的长度。而保存到字典conv2length中的value却是input_id加上图片token的长度。

        if self.group_by_length:
            self.conv2length = {}  # Using a dictionary to speed up token length calculation
            self.length = []
            for data_item in self.raw_data:
                data_item = json.loads(data_item)
                if 'length' in data_item:
                    token_length = data_item['length']  # Use precomputed length if available
                else:
                    # Compute token length using the tokenizer
                    conversations = '\n'.join([temp['value'] for temp in data_item['conversations']])
                    str_length = len(conversations)
                    if str_length not in self.conv2length:
                        token_length = tokenizer(
                            conversations, return_tensors='pt', padding=False, truncation=False,
                        ).input_ids.size(1)
                        self.conv2length[str_length] = token_length + num_image_token * (
                                    max_dynamic_patch + use_thumbnail)
                    else:
                        token_length = self.conv2length[str_length]
                self.length.append(token_length)

HalvesChen avatar Oct 10 '25 02:10 HalvesChen