Wrong link highlight position on searching a word
in some pdf, the marking is not placed correctly due to layout issues or any other reason.
but the size of the text was correct marking but the placement goes a little differently
example as shown in the image where two words were correctly placed but see the bottom of the image it was wrongly placed
pls check this
@espresso3389
Originally posted by @dineshnaikb in https://github.com/espresso3389/pdfrx/issues/189#issuecomment-2247131510
@dineshnaikb There seems some kind of mis-calculation of text index.
From the screenshot, I'm guessing that for the 3rd one, indexOf("where") on the "Countries where" returns 5 instead of 10. The 4th one return 3.
The calculation of the text index is accumlative. So some letters on the text mis-increase the text index and it results in such problem.
If possible, could you allow me to access the PDF file (Or at least the page of the problem) to reproduce the issue?
I just visualized the text handling error:
with the following code:
class PdfTextSearcherExt extends PdfTextSearcher {
PdfTextSearcherExt(super.controller);
@override
void pageTextMatchPaintCallback(Canvas canvas, Rect pageRect, PdfPage page) {
super.pageTextMatchPaintCallback(canvas, pageRect, page);
final text = getText(page.pageNumber);
if (text == null) return;
for (final f in text.fragments) {
if (f.charRects == null) continue;
for (int i = 0; i < f.charRects!.length; i++) {
final rect = f.charRects![i].toRectInPageRect(page: page, pageRect: pageRect);
final tf = TextPainter(
text: TextSpan(
text: f.text[i],
style: const TextStyle(color: Colors.red, fontSize: 4),
),
textDirection: TextDirection.ltr,
);
tf.layout();
tf.paint(canvas, rect.bottomLeft + Offset((rect.width - tf.width) / 2, 0));
canvas.drawRect(
Rect.fromLTRB(rect.left, rect.top, rect.right, rect.bottom + tf.height),
Paint()
..color = Colors.red.withAlpha(127)
..style = PaintingStyle.stroke);
}
}
}
PdfPageText? getText(int pageNumber) {
final cached = getCachedTextIfAvailable(pageNumber: pageNumber);
if (cached != null) {
return cached;
}
_loadTextAsync(pageNumber);
return null;
}
Future<void> _loadTextAsync(int pageNumber) async {
final text = await loadText(pageNumber: pageNumber);
if (text != null) {
notifyListeners();
}
}
}
Looking closer and deeper; we cab see two space-like characters that is skipped during text processing.
The cause of the issue is the design of FPDFText_GetText function:
// Function: FPDFText_GetText
// Extract unicode text string from the page.
// Parameters:
// text_page - Handle to a text page information structure.
// Returned by FPDFText_LoadPage function.
// start_index - Index for the start characters.
// count - Number of UCS-2 values to be extracted.
// result - A buffer (allocated by application) receiving the
// extracted UCS-2 values. The buffer must be able to
// hold `count` UCS-2 values plus a terminator.
// Return Value:
// Number of characters written into the result buffer, including the
// trailing terminator.
// Comments:
// This function ignores characters without UCS-2 representations.
// It considers all characters on the page, even those that are not
// visible when the page has a cropbox. To filter out the characters
// outside of the cropbox, use FPDF_GetPageBoundingBox() and
// FPDFText_GetCharBox().
//
FPDF_EXPORT int FPDF_CALLCONV FPDFText_GetText(FPDF_TEXTPAGE text_page,
int start_index,
int count,
unsigned short* result);
It states that the function ignores characters without UCS-2 representations. And it results in the mis-handling of text layout on the PDF.
So the workaround is easy, we should call FPDFText_GetUnicode for all characters. But the problem is that FFI function calls are sometimes costy and it may result in poorer performance on extracting text than FPDFText_GetText.
And, with the workaround above, we can now correctly get the text.
According to the tests on Windows and WASM, the speed is good enough and we can use the workaround.
pdfrx 1.1.24 contains the fix.
"Character count mismatch between FPDFText_CountChars() and FPDFText_GetText()." https://issues.chromium.org/issues/42270558