textract icon indicating copy to clipboard operation
textract copied to clipboard

epub parser: separate text blocks of logical elements by "Form Feed"

Open workflowsguy opened this issue 4 years ago • 0 comments

When combining the text read from individual book elements of an epub file, those elements are currently separated only by an '\n' character.

I suggest separating them by a '\f' character instead. This would be analogous to current text extraction from PDF files, where the "logical elements" "individual pages" are also separated by a Form Feed.

This would help to maintain at least some kind of structure of the original file in the resulting txt file and thus make parsing the logical structure possible.

workflowsguy avatar Mar 01 '20 17:03 workflowsguy