sumatrapdf IFilter related small "features"

IFilter related small "features"

Open umeca74 opened this issue 8 months ago • 7 comments

I am perusing the current Sumatra codebase (v3.60) trying to extend the PDF IFilter DLL support to other ebooks. More on that another time

I have discovered the following small issues, that you may consider addressing in a future version

Comparing text extraction in FzTextPageToStr with muPDF's do_as_text, you are only considering FZ_STEXT_BLOCK_TEXT whereas mupdf also does FZ_STEXT_BLOCK_STRUCT conversions. Not sure how often these blocks appear, perhaps only when FZ_STEXT_COLLECT_STRUCTURE flag is used?
The quality of text extraction will improve passing FZ_STEXT_INHIBIT_SPACES | FZ_STEXT_DEHYPHENATE options to fz_new_stext_page_from_page, wasting imaginary spaces and hyphens at the end of line
As text is extracted in page chunks, it makes sense to mark the breakType as CHUNK_EOP instead of CHUNK_NO_BREAK
There are some memory leaks as DestroyLogging and DestroyTempAllocator aren't called during DLL_PROCESS_DETACH. Sumatra uses a lot of globals, not sure if more needs to be cleaned up(?)

Finally if you add AddLineSep("\r\n") during text extraction you don't have to replace \n post-facto within GetNextChunkValue

hth nikos

Mar 20 '25 13:03 umeca74