textract
textract copied to clipboard
epub parser: separate text blocks of logical elements by "Form Feed"
When combining the text read from individual book elements of an epub file, those elements are currently separated only by an '\n' character.
I suggest separating them by a '\f' character instead. This would be analogous to current text extraction from PDF files, where the "logical elements" "individual pages" are also separated by a Form Feed.
This would help to maintain at least some kind of structure of the original file in the resulting txt file and thus make parsing the logical structure possible.