chapterize
chapterize copied to clipboard
A simple tool for splitting up an ebook into its chapters. Works well with Project Gutenberg texts. May also be used to clean up books for computational text analysis.
in chapterize.py line 173 `endLocation = len(self.lines)-1 # The end` I think it's better set as `len(self.lines)` because if we can't detect the end location, the last line could possibly...
Hi. I tried using chapterize with a text that has chapter titles in this format : '4. The Black Bird' (With number then title), and chapterize returns the headings <...
It'd be nice to have automatic tests and hook in Travis or some other CI. But it might be better just to deprecate this version of the tool, anyway, and...
I wrote a quick-and-dirty HTML chapterizer that could be integrated with this one: https://github.com/JonathanReeve/chapter-experiments/blob/master/chapterize-html.ipynb
It would be nice if this could parse short stories, like this: http://www.gutenberg.org/cache/epub/25519/pg25519.txt Possibly detecting a 'contents' section and getting the titles from there would work, at least for that...
Just write to log: text name, number of chapters, lengths of each chapter. This will enable studies of lots of texts at a time.
The NLTK's word_tokenize function requires Punkt data to be downloaded, which could effectively break the program for those that don't know what's going on.