MedQA icon indicating copy to clipboard operation
MedQA copied to clipboard

Truncating Longest Sequence mechanism and evidence/question enquiry

Open laurauzcategui opened this issue 2 years ago • 0 comments

Hi @jind11 ,

I hope you are great :) I'm working on a project for my class XCS224U - Natural Language Understanding @ Stanford, and I'm trying to use MedQA fine-tuning mechanisms as one of my baselines.

Currently, I have processed and read the MedQA question-answer pairs and prepared the context retrieved by IR, when creating the batch of sequences for fine-tuning with BERT, I tried to truncate the context only as you mentioned in the paper, but got an error related to not being able to do it.

I was wondering if you tokenized and truncated each pair of sentences separately and concatenated both to get the input_ids or if you did use the Tokenizers provided by HuggingFace

Currently, I'm using BertTokenizers from Huggingface as follows:

tokenized_examples = bert_tokenizer(first_sentences, second_sentences, truncation='only_first', max_length=512)

I tried using only_first in truncation parameter given what it's mentioned in the paper:

We truncate the longest sequence to 512 tokens after sentence-piece tokenization (we only truncate context

Another question I have is: when you were passing the context and $qa_{i}$ pair to the reader model, did you use for the question q the raw evidence question or did you used the words extracted by metamap tool ?

Thanks :)

laurauzcategui avatar Oct 17 '22 14:10 laurauzcategui