azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Citations aren't consistently returned in the expected format
Using the latest code without any change other than replaced the data files, asked the same question:"can you find the CoA file D0865021125133"
here is the error:
here is the data file that has the issue
D0865021125133_CoA.pdf
found out that the LLM returns the file link (D0865021125133_CoA.pdf#page=1) sometimes in the front, sometimes in the end (Source: D0865021125133_CoA.pdf#page=1)
The position shouldn't matter, what matters is the enclosing square brackets, like [D0865021125133_CoA.pdf#page=1] The citation extracting code is in AnswerParser.tsx, so you can modify that to be looser if the LLM isn't adhering strictly, but that risks your code extracting things that aren't actually citations. What model are you using? I assume you didn't change the part of the prompt that explained the square bracket formatting?
thanks for replying, I'm using gpt-4.1-mini, here is the returned parsedAnswer: The certificate of analysis for Ghost Dye™ Red 780 Lot: D0865021125133 includes the following details:
- Cat. No.: 13-0865-T100
- Format: Ghost Dye™
- Concentration: 1 µL/test
- Volume: 0.11 mL
- Size: 100 tests
- Storage: ≤ -20℃
- Formulation: 1 µL/test in DMSO
- QC testing: Performance confirmed by flow cytometry
- Expiration date: 31-Oct-2027
- This product lot has passed Cytek Bioscience's Quality Control (QC) Tests and is certified for Research Use Only.
- Not for use in diagnostic or therapeutic procedures. Not for resale.
- Cytek Biosciences will not be held responsible for patent infringement or other violations that may occur with the use of this product.
Contact information: Cytek Biosciences Inc. 10840 Thornmint Road, San Diego, CA 92127 Phone: +1 (510) 657-0102 Fax: +1 (510) 657-0151 Email: [email protected] Website: www.cytekbio.com
[Sources: D0865021125133_CoA.pdf#page=1][D2.pdf#page=1][814522D1_CoA.pdf#page=1]
it has the enclosing square brackets, "[]", but has "Sources:" somehow it won't pick it up as citation, so in code, I'm looking for "Sources" string, // 2) Detect trailing "Source: ..." references const sourceRegex = /Source:\s*(.+).?$/i; const sourceMatch = parsedAnswer.match(sourceRegex); if (sourceMatch && sourceMatch[1]) { // Remove the "Source: ..." text from the final displayed answer parsedAnswer = parsedAnswer.substring(0, sourceMatch.index).trim();
// Split multiple references by commas or 'and'
const possibleCitations = sourceMatch[1]
.split(/\band\b|,/i)
.map(str => str.trim())
.filter(Boolean);
// Add each reference to the citations array if valid
possibleCitations.forEach(ref => {
// Optional: skip isCitationValid if you want all "Source:" references included
if (!citations.includes(ref) && isCitationValid(contextDataPoints, ref)) {
citations.push(ref);
}
});
I am currently on a similar boat, where the LLM returns the citations as [random_document.pdf#pages=1-2]. This only happens when citations are being added as a subscript to a passage. I will say that this happens very rarely and have only run into it a handful of times.
The prompt instructions mention the square brackets formatting as well. I've added additional instructions which explicitly tells the LLM to not combine these sources, but I still occasionally run into it.
Below is an example:
@Daimler-Garay I've just added the actual allowed citations in the most recent version of the prompts- https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty#L36
Have you tried that approach? (The frontend can then check citations against a citations list, so that it's not relying just on square bracket regex to verify what's a citation).
Thanks for the response! I'll add that change and see if I still run into the issue.
I did run into a different issue today, which you can encounter when you have a higher top=N - the model suggested citations with page ranges:
[Northwind_Standard_Benefits_Details.pdf#pages=36-37,42-43][Northwind_Health_Plus_Benefits_Details.pdf#pages=38-39].
I need to add instructions to the prompt to avoid that, since those are harder to turn into clickable links.
I've seen this issue as well, but I haven't changed the top=N value. Weird. I also still ran into the issue with the subscript despite adding the additional instruction (Possible citations for current question: {% for citation in citations %} [{{ citation }}] {% endfor %}).