chapterize icon indicating copy to clipboard operation
chapterize copied to clipboard

A simple tool for splitting up an ebook into its chapters. Works well with Project Gutenberg texts. May also be used to clean up books for computational text analysis.

Results 7 chapterize issues
Sort by recently updated
recently updated
newest added

in chapterize.py line 173 `endLocation = len(self.lines)-1 # The end` I think it's better set as `len(self.lines)` because if we can't detect the end location, the last line could possibly...

Hi. I tried using chapterize with a text that has chapter titles in this format : '4. The Black Bird' (With number then title), and chapterize returns the headings <...

It'd be nice to have automatic tests and hook in Travis or some other CI. But it might be better just to deprecate this version of the tool, anyway, and...

I wrote a quick-and-dirty HTML chapterizer that could be integrated with this one: https://github.com/JonathanReeve/chapter-experiments/blob/master/chapterize-html.ipynb

It would be nice if this could parse short stories, like this: http://www.gutenberg.org/cache/epub/25519/pg25519.txt Possibly detecting a 'contents' section and getting the titles from there would work, at least for that...

Just write to log: text name, number of chapters, lengths of each chapter. This will enable studies of lots of texts at a time.

The NLTK's word_tokenize function requires Punkt data to be downloaded, which could effectively break the program for those that don't know what's going on.