transformers.js Add start and end char indexes to QnA pipeline

Summary of change: QuestionAnsweringPipeline returns the start and end character index values of the answer within context.

Reason for request:

This allows the end-user to view where the answer is sourced from the context document. The python version of Transformers currently returns those indexes for the question-answering pipeline.
This feature was commented as a TODO by Xenova.

Feature was discussed on this issue: https://github.com/xenova/transformers.js/issues/312

Sep 27 '23 19:09 alex-breen

Thanks for making this PR! :)

However, there are some differences in start/end values for certain inputs, where characters are stripped. For example:

Python (original)

from transformers import pipeline
oracle = pipeline('question-answering', "distilbert-base-cased-distilled-squad")

question="Where do I live?"
context="My name is Wolfgang \n\n\n\n\n\n\n\n and I live in Berlin"
out = oracle(question, context)
# {'score': 0.9925103187561035, 'start': 43, 'end': 49, 'answer': 'Berlin'}

JavaScript (your PR)

import { pipeline } from "@xenova/transformers";

let oracle = await pipeline('question-answering', "Xenova/distilbert-base-cased-distilled-squad");

let question = "Where do I live?";
let context = "My name is Wolfgang \n\n\n\n\n\n\n\n and I live in Berlin";
let out = await oracle(question, context)
// { answer: 'Berlin', score: 0.9941156970411846, start: 34, end: 40 }

As stated in the docs:

start (int) — The character start index of the answer (in the tokenized version of the input).
end (int) — The character end index of the answer (in the tokenized version of the input).

Oct 04 '23 16:10 xenova

Thanks for the feedback!

I've made a new commit that handles escaped space inputs. The new method compares the tokenized text from the model output to the original input text in order to find escaped spaces and add them back.

A few comments:

It's a change on the edge: An advantage of the method used is it is relatively non-obtrusive -- just one area of additional code. But it doesn't feel elegant, more like a shim, compared to making deeper changes at the core.
Forward slashes: When testing, I get the result in the comment above. Other tests produce expected results, including "extra" spaces. The only exception is when there are intentional forward slashes (e.g. "the path is c:\\downloads"). The first slash is not counted. But it probably shouldn't be counted as that is how javascript handles strings - that slash will be escaped or hidden by other js string methods. So, while it is probably correctly handled from a system perspective, in pure text terms, it feels incorrect. However, counting it feels like it would introduce further problems.
Hugging Face's demo page for Question Answering produces different results - https://huggingface.co/tasks/question-answering. It could just be that the demo webpage passes the string in raw form (e.g. String.raw), so characters are not escaping. This doesn't feel right to me, but I wanted to raise it.
This may just be me getting my nomenclature confused, but when I read this line of the documentation: "The character start index of the answer (in the tokenized version of the input)", I interpret "tokenized version" as meaning the text would not include a space for \n because in the tokenized version it has been stripped. However, in the code comments of question_answering.py (line 571), it states,"# Start: Index of the first character of the answer in the context string." To me, that means escaped space characters would be counted, as the code commit currently does. That feels right to me (and is consistent with the comment above), but I wanted to point out that I see a discrepancy between the code comment and the documentation. Again, that could just be my own nomenclature confusion.

Oct 06 '23 22:10 alex-breen

@alex-breen would this be possible for the other pipelines as well, e.g. token-classification?

Nov 19 '23 17:11 ldenoue

@ldenoue I don't think this code can be directly reused for other pipelines like token-classification, but the method of deriving character counts based on token index could probably be applied in a similar fashion.

Nov 22 '23 06:11 alex-breen

@xenova is there anything else you'd like from me for this PR?

Feb 21 '24 16:02 alex-breen