FinQA icon indicating copy to clipboard operation
FinQA copied to clipboard

Input_mask and number_indices doesn't match, because of [cls] at the beginning

Open KnightZhang625 opened this issue 3 years ago • 0 comments

def convert_single_mathqa_example(example, is_training, tokenizer, max_seq_length,
                                  max_program_length, op_list, op_list_size,
                                  const_list, const_list_size,
                                  cls_token, sep_token):
    """Converts a single MathQAExample into an InputFeature."""
    features = []
    question_tokens = example.question_tokens
    if len(question_tokens) > max_seq_length - 2:
        print("too long")
        question_tokens = question_tokens[:max_seq_length - 2]
    tokens = [cls_token] + question_tokens + [sep_token]         # 1. This line add [cls_token] at beginning.
    segment_ids = [0] * len(tokens)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    input_mask = [1] * len(input_ids)
    for ind, offset in enumerate(example.number_indices):          # 2. Why don't number_indices offset by 1 ?
        if offset < len(input_mask):
            input_mask[offset] = 2
        else:
            if is_training == True:

                # invalid example, drop for training
                return features

            # assert is_training == False

Hello, Thanks for the great work! However, I am confused with the code. In the 1. comment, you add [cls_token] in front of the tokens, which means that the indices of tokens in the tokens will shift to the right by 1. In. 2. comment, you just use the example.number_indices to assign 2 to the indices of numbers, this is confusing, since input_mask is created from the tokens, which contains the [cls] at the beginning. For example: tokens: [[cls], a, b, 1, c, d], the example.number_indices will be [2] (because when you calculate the example.number_indices, there is no [cls] at the beginning, the "2" refers to the number "1"'s index ), the corresponding input_mask will be [1, 1, 1, 1, 1, 1]. When you try to assign the numbers' indices to 2 by the example.number_indices , the input_mask will be [1, 1, 0, 1, 1, 1], however, the 0'index 2 refers to the "b" in the tokens. Could you please explain this? Thanks!

KnightZhang625 avatar Feb 15 '22 03:02 KnightZhang625