sumatrapdf
sumatrapdf copied to clipboard
IFilter related small "features"
I am perusing the current Sumatra codebase (v3.60) trying to extend the PDF IFilter DLL support to other ebooks. More on that another time
I have discovered the following small issues, that you may consider addressing in a future version
- Comparing text extraction in
FzTextPageToStrwith muPDF'sdo_as_text, you are only consideringFZ_STEXT_BLOCK_TEXTwhereas mupdf also doesFZ_STEXT_BLOCK_STRUCTconversions. Not sure how often these blocks appear, perhaps only whenFZ_STEXT_COLLECT_STRUCTUREflag is used? - The quality of text extraction will improve passing
FZ_STEXT_INHIBIT_SPACES | FZ_STEXT_DEHYPHENATEoptions tofz_new_stext_page_from_page, wasting imaginary spaces and hyphens at the end of line - As text is extracted in page chunks, it makes sense to mark the breakType as
CHUNK_EOPinstead ofCHUNK_NO_BREAK - There are some memory leaks as
DestroyLoggingandDestroyTempAllocatoraren't called duringDLL_PROCESS_DETACH. Sumatra uses a lot of globals, not sure if more needs to be cleaned up(?)
Finally if you add AddLineSep("\r\n") during text extraction you don't have to replace \n post-facto within GetNextChunkValue
hth nikos