go-pdfium
go-pdfium copied to clipboard
Fix the text extraction wrong after special characters
the FPDFText_CountChars function returned text char count in this page,
and the FPDFText_GetCharBox function returned text char box in this page, the text_index argument means character index
but the FPDFText_GetText function returned a sub-string of page text, the text_index argument means UCS16 code index!
so, if some character need two(or more) UCS16 codes, then the text char was mistach the text char box after it!
I came across this case, and found a solution: use the FPDFText_GetUnicode function instead FPDFText_GetText.
the FPDFText_GetUnicode function returned text char with unicode, the text_index argument means character index too.
Finally, I feel that pdfium may always assume that UTF16 takes only 2 bytes per character :(
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 75.54%. Comparing base (
3516e50) to head (bad8a27). Report is 39 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #212 +/- ##
==========================================
+ Coverage 75.51% 75.54% +0.02%
==========================================
Files 110 110
Lines 25318 25319 +1
==========================================
+ Hits 19119 19126 +7
+ Misses 4365 4361 -4
+ Partials 1834 1832 -2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Hi! We have noticed issues with this before, see some of the discussion in: https://groups.google.com/g/pdfium/c/HwqzzGWWXVU/m/HdM-Lqb5AQAJ
However, we have never received any feedback from pdfium about surrogate pairs.
Would it be possible to add a unit test for this? And it's missing the WebAssembly implementation, but I can also make that.
Hi! We have noticed issues with this before, see some of the discussion in: https://groups.google.com/g/pdfium/c/HwqzzGWWXVU/m/HdM-Lqb5AQAJ
However, we have never received any feedback from pdfium about surrogate pairs.
Would it be possible to add a unit test for this? And it's missing the WebAssembly implementation, but I can also make that.
I found this solution by analyzing the pdfium source code. I think it's better to use FPDF_GetUnicode than FPDF_GetText, whether pdfium handles surrogate pairs or not.
// FPDF_GetText core code:
ByteString str = textpage->GetPageText(start_index, char_count).ToUCS2LE(); // cast
auto str_span = fxcrt::reinterpret_span<const unsigned short>(str.span());
fxcrt::spancpy(result_span, str_span); // copy
// FPDF_GetUnicode core code:
const CPDF_TextPage::CharInfo& charinfo = textpage->GetCharInfo(index); // reference
return charinfo.m_Unicode; // just return uint32
I am use in Windows and Linux, I haven't used WebAssembly :p
Yeah I can take care of the WebAssembly implementation, but it would be nice to have a sample PDF that would fail with the current implementation, just so that we can confirm that the fix works and then it won't break again in the future
Sorry, I'm too busy.
I don't have time to write test case. This is the test code:
logFile, _ := os.Create("output.log")
defer logFile.Close()
logger := log.New(logFile, "", 0)
pool := single_threaded.Init(single_threaded.Config{})
defer pool.Close()
instance, _ := pool.GetInstance(2 * time.Second)
defer instance.Close()
pdfData, _ := os.ReadFile("rect-wrong.pdf")
doc, _ := instance.OpenDocument(&requests.OpenDocument{
File: &pdfData,
})
defer instance.FPDF_CloseDocument(&requests.FPDF_CloseDocument{
Document: doc.Document,
})
pagesNum, _ := instance.FPDF_GetPageCount(&requests.FPDF_GetPageCount{
Document: doc.Document,
})
for pageIndex := 0; pageIndex < pagesNum.PageCount; pageIndex++ {
textRes, _ := instance.GetPageTextStructured(&requests.GetPageTextStructured{
Page: requests.Page{
ByIndex: &requests.PageByIndex{
Document: doc.Document,
Index: pageIndex,
},
},
Mode: requests.GetPageTextStructuredModeChars,
})
for charIndex, char := range textRes.Chars {
logger.Printf("pageIndex=%d charIndex %d text=%q left=%f top=%f right=%f bottom=%f\n",
pageIndex, charIndex, char.Text, char.PointPosition.Left, char.PointPosition.Top, char.PointPosition.Right, char.PointPosition.Bottom)
}
}
Before fix, the test result:
After fix, the test result:
Problem document: rect-wrong.pdf
Fixing this in #242
Fixed in v1.15.0!