pdfrx icon indicating copy to clipboard operation
pdfrx copied to clipboard

Wrong link highlight position on searching a word

Open espresso3389 opened this issue 1 year ago • 2 comments

image in some pdf, the marking is not placed correctly due to layout issues or any other reason. but the size of the text was correct marking but the placement goes a little differently example as shown in the image where two words were correctly placed but see the bottom of the image it was wrongly placed pls check this @espresso3389

Originally posted by @dineshnaikb in https://github.com/espresso3389/pdfrx/issues/189#issuecomment-2247131510

espresso3389 avatar Jul 24 '24 08:07 espresso3389

@dineshnaikb There seems some kind of mis-calculation of text index.

From the screenshot, I'm guessing that for the 3rd one, indexOf("where") on the "Countries where" returns 5 instead of 10. The 4th one return 3.

The calculation of the text index is accumlative. So some letters on the text mis-increase the text index and it results in such problem.

If possible, could you allow me to access the PDF file (Or at least the page of the problem) to reproduce the issue?

espresso3389 avatar Jul 24 '24 08:07 espresso3389

I just visualized the text handling error:

Image

with the following code:


class PdfTextSearcherExt extends PdfTextSearcher {
  PdfTextSearcherExt(super.controller);

  @override
  void pageTextMatchPaintCallback(Canvas canvas, Rect pageRect, PdfPage page) {
    super.pageTextMatchPaintCallback(canvas, pageRect, page);

    final text = getText(page.pageNumber);
    if (text == null) return;

    for (final f in text.fragments) {
      if (f.charRects == null) continue;
      for (int i = 0; i < f.charRects!.length; i++) {
        final rect = f.charRects![i].toRectInPageRect(page: page, pageRect: pageRect);
        final tf = TextPainter(
          text: TextSpan(
            text: f.text[i],
            style: const TextStyle(color: Colors.red, fontSize: 4),
          ),
          textDirection: TextDirection.ltr,
        );
        tf.layout();
        tf.paint(canvas, rect.bottomLeft + Offset((rect.width - tf.width) / 2, 0));
        canvas.drawRect(
            Rect.fromLTRB(rect.left, rect.top, rect.right, rect.bottom + tf.height),
            Paint()
              ..color = Colors.red.withAlpha(127)
              ..style = PaintingStyle.stroke);
      }
    }
  }

  PdfPageText? getText(int pageNumber) {
    final cached = getCachedTextIfAvailable(pageNumber: pageNumber);
    if (cached != null) {
      return cached;
    }
    _loadTextAsync(pageNumber);
    return null;
  }

  Future<void> _loadTextAsync(int pageNumber) async {
    final text = await loadText(pageNumber: pageNumber);
    if (text != null) {
      notifyListeners();
    }
  }
}

espresso3389 avatar Apr 23 '25 09:04 espresso3389

Looking closer and deeper; we cab see two space-like characters that is skipped during text processing. Image

espresso3389 avatar Apr 23 '25 11:04 espresso3389

The cause of the issue is the design of FPDFText_GetText function:

// Function: FPDFText_GetText
//          Extract unicode text string from the page.
// Parameters:
//          text_page   -   Handle to a text page information structure.
//                          Returned by FPDFText_LoadPage function.
//          start_index -   Index for the start characters.
//          count       -   Number of UCS-2 values to be extracted.
//          result      -   A buffer (allocated by application) receiving the
//                          extracted UCS-2 values. The buffer must be able to
//                          hold `count` UCS-2 values plus a terminator.
// Return Value:
//          Number of characters written into the result buffer, including the
//          trailing terminator.
// Comments:
//          This function ignores characters without UCS-2 representations.
//          It considers all characters on the page, even those that are not
//          visible when the page has a cropbox. To filter out the characters
//          outside of the cropbox, use FPDF_GetPageBoundingBox() and
//          FPDFText_GetCharBox().
//
FPDF_EXPORT int FPDF_CALLCONV FPDFText_GetText(FPDF_TEXTPAGE text_page,
                                               int start_index,
                                               int count,
                                               unsigned short* result);

It states that the function ignores characters without UCS-2 representations. And it results in the mis-handling of text layout on the PDF.

espresso3389 avatar Apr 23 '25 16:04 espresso3389

So the workaround is easy, we should call FPDFText_GetUnicode for all characters. But the problem is that FFI function calls are sometimes costy and it may result in poorer performance on extracting text than FPDFText_GetText.

espresso3389 avatar Apr 23 '25 16:04 espresso3389

And, with the workaround above, we can now correctly get the text.

Image

espresso3389 avatar Apr 23 '25 16:04 espresso3389

According to the tests on Windows and WASM, the speed is good enough and we can use the workaround.

espresso3389 avatar Apr 23 '25 16:04 espresso3389

pdfrx 1.1.24 contains the fix.

espresso3389 avatar Apr 28 '25 17:04 espresso3389

"Character count mismatch between FPDFText_CountChars() and FPDFText_GetText()." https://issues.chromium.org/issues/42270558

espresso3389 avatar Nov 03 '25 16:11 espresso3389