MedQA
MedQA copied to clipboard
Truncating Longest Sequence mechanism and evidence/question enquiry
Hi @jind11 ,
I hope you are great :) I'm working on a project for my class XCS224U - Natural Language Understanding @ Stanford, and I'm trying to use MedQA fine-tuning mechanisms as one of my baselines.
Currently, I have processed and read the MedQA question-answer pairs and prepared the context retrieved by IR, when creating the batch of sequences for fine-tuning with BERT, I tried to truncate the context only as you mentioned in the paper, but got an error related to not being able to do it.
I was wondering if you tokenized and truncated each pair of sentences separately and concatenated both to get the input_ids or if you did use the Tokenizers provided by HuggingFace
Currently, I'm using BertTokenizers from Huggingface as follows:
tokenized_examples = bert_tokenizer(first_sentences, second_sentences, truncation='only_first', max_length=512)
I tried using only_first
in truncation parameter given what it's mentioned in the paper:
We truncate the longest sequence to 512 tokens after sentence-piece tokenization (we only truncate context
Another question I have is: when you were passing the context and $qa_{i}$ pair to the reader model, did you use for the question q the raw evidence question or did you used the words extracted by metamap tool ?
Thanks :)