natural-questions
natural-questions copied to clipboard
Convert token to text
Is there a way to convert the output (currently in the form of tokens) of the model to text for easy interpretation and testing?
For example, the annotator marks the long answer using byte offsets, token offsets, and an index into the list of long answer candidates: "long_answer": { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0 }. How to map these bytes and tokens to the text containing the answer.
you might want to try something like this
import jsonlines
INPUT_FILE = "nq-train-sample.jsonl"
START_TOKEN = 3521
END_TOKEN = 3525
QAS_ID = 4549465242785278785
REMOVE_HTML = True
def get_span_from_token_offsets(f, start_token, end_token, qas_id,
remove_html):
for obj in f:
if obj["example_id"] != qas_id:
continue
if remove_html:
answer_span = [
item["token"]
for item in obj["document_tokens"][start_token:end_token]
if not item["html_token"]
]
else:
answer_span = [
item["token"]
for item in obj["document_tokens"][start_token:end_token]
]
return " ".join(answer_span)
with jsonlines.open(INPUT_FILE) as f:
result = get_span_from_token_offsets(f, START_TOKEN, END_TOKEN, QAS_ID,
REMOVE_HTML)
print(result)
Output: March 18 , 2018
you can read your prediction file to get the various start_tokens, end_tokens, and example_ids, then iteratively call the function to get a list of the prediction spans (write to file or whatever)
hope this helps!