go-pdfium icon indicating copy to clipboard operation
go-pdfium copied to clipboard

Fix the text extraction wrong after special characters

Open xuges opened this issue 9 months ago • 4 comments

the FPDFText_CountChars function returned text char count in this page, and the FPDFText_GetCharBox function returned text char box in this page, the text_index argument means character index but the FPDFText_GetText function returned a sub-string of page text, the text_index argument means UCS16 code index! so, if some character need two(or more) UCS16 codes, then the text char was mistach the text char box after it! I came across this case, and found a solution: use the FPDFText_GetUnicode function instead FPDFText_GetText.

the FPDFText_GetUnicode function returned text char with unicode, the text_index argument means character index too.

Finally, I feel that pdfium may always assume that UTF16 takes only 2 bytes per character :(

xuges avatar Feb 27 '25 07:02 xuges

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 75.54%. Comparing base (3516e50) to head (bad8a27). Report is 39 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #212      +/-   ##
==========================================
+ Coverage   75.51%   75.54%   +0.02%     
==========================================
  Files         110      110              
  Lines       25318    25319       +1     
==========================================
+ Hits        19119    19126       +7     
+ Misses       4365     4361       -4     
+ Partials     1834     1832       -2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Feb 27 '25 07:02 codecov[bot]

Hi! We have noticed issues with this before, see some of the discussion in: https://groups.google.com/g/pdfium/c/HwqzzGWWXVU/m/HdM-Lqb5AQAJ

However, we have never received any feedback from pdfium about surrogate pairs.

Would it be possible to add a unit test for this? And it's missing the WebAssembly implementation, but I can also make that.

jerbob92 avatar Feb 27 '25 08:02 jerbob92

Hi! We have noticed issues with this before, see some of the discussion in: https://groups.google.com/g/pdfium/c/HwqzzGWWXVU/m/HdM-Lqb5AQAJ

However, we have never received any feedback from pdfium about surrogate pairs.

Would it be possible to add a unit test for this? And it's missing the WebAssembly implementation, but I can also make that.

I found this solution by analyzing the pdfium source code. I think it's better to use FPDF_GetUnicode than FPDF_GetText, whether pdfium handles surrogate pairs or not.

// FPDF_GetText core code:
  ByteString str = textpage->GetPageText(start_index, char_count).ToUCS2LE();   // cast
  auto str_span = fxcrt::reinterpret_span<const unsigned short>(str.span());
  fxcrt::spancpy(result_span, str_span);  // copy

// FPDF_GetUnicode core code:
const CPDF_TextPage::CharInfo& charinfo = textpage->GetCharInfo(index);  // reference
return charinfo.m_Unicode;  // just return uint32

I am use in Windows and Linux, I haven't used WebAssembly :p

xuges avatar Feb 28 '25 02:02 xuges

Yeah I can take care of the WebAssembly implementation, but it would be nice to have a sample PDF that would fail with the current implementation, just so that we can confirm that the fix works and then it won't break again in the future

jerbob92 avatar Mar 03 '25 14:03 jerbob92

Sorry, I'm too busy.

I don't have time to write test case. This is the test code:

	logFile, _ := os.Create("output.log")
	defer logFile.Close()
	logger := log.New(logFile, "", 0)
	
	pool := single_threaded.Init(single_threaded.Config{})
	defer pool.Close()

	instance, _ := pool.GetInstance(2 * time.Second)
	defer instance.Close()
	
	pdfData, _ := os.ReadFile("rect-wrong.pdf")

	doc, _ := instance.OpenDocument(&requests.OpenDocument{
		File: &pdfData,
	})

	defer instance.FPDF_CloseDocument(&requests.FPDF_CloseDocument{
		Document: doc.Document,
	})

	pagesNum, _ := instance.FPDF_GetPageCount(&requests.FPDF_GetPageCount{
		Document: doc.Document,
	})
	
	for pageIndex := 0; pageIndex < pagesNum.PageCount; pageIndex++ {
		textRes, _ := instance.GetPageTextStructured(&requests.GetPageTextStructured{
			Page: requests.Page{
				ByIndex: &requests.PageByIndex{
					Document: doc.Document,
					Index:    pageIndex,
				},
			},
			Mode: requests.GetPageTextStructuredModeChars,
		})

		for charIndex, char := range textRes.Chars {
			logger.Printf("pageIndex=%d charIndex %d text=%q left=%f top=%f right=%f bottom=%f\n",
				pageIndex, charIndex, char.Text, char.PointPosition.Left, char.PointPosition.Top, char.PointPosition.Right, char.PointPosition.Bottom)
		}
	}

Before fix, the test result: 微信截图_20250716141028

After fix, the test result: 2

Problem document: rect-wrong.pdf

xuges avatar Jul 16 '25 06:07 xuges

Fixing this in #242

jerbob92 avatar Jul 23 '25 11:07 jerbob92

Fixed in v1.15.0!

jerbob92 avatar Jul 23 '25 12:07 jerbob92