2017
2017 copied to clipboard
Pride, Prejudice by @hugovk
Pride, Prejudice
Generated output
- output.txt: 103,038 words
- output2.txt: 51,142 words
What it does
The problem isn't generating over 50,000 words. The problem is existing books are too long. Pride and Prejudice is 130,000 words, Moby Dick is 215,136 words (or 215,136 meows). And we all know 50,000 is the gold standard for a novel! So how can we reduce the word count?
- Remove Project Gutenberg boilerplate
- Use contractions everywhere: "won't" instead of "will not", "t'" instead of "the"
- Replace "and" with a comma, "or" with a slash
- Delete parenthetical "however", "indeed" and "I dare say"
- Remove honorifics (Mr., Mrs., Miss, Dr.)
- 'Substitute ‘damn’ every time you’re inclined to write ‘very’; your editor will delete it and the writing will be just as it should be.'
- Replace redundant phrases like "whether or not" with just "whether"
These tactics reduce Pride and Prejudice by about 15% to 111,000 words.
Next we work out the ratio of words we have to 50k, count how many sentences we have, and work out how many sentences we want to approach 50k and use a text summariser to chop out the dead wood.
How to do it
Run:
pip install -r requirements.txt
python reducifier.py
Example:
python reducifier.py
open
word count: 130,000
word count: 126,936 diff: 97.643% deboilerplatify
word count: 125,438 diff: 96.491% remove_quote_things
word count: 121,549 diff: 93.499% deveryify
word count: 121,018 diff: 93.091% decontractify
word count: 111,633 diff: 85.872% dehonorify
Ratio (words/50k): 3
Number of sentences: 4588
Number to keep: 1529
word count: 54,273 diff: 41.748% summarise
This produces output.txt before the summariser, and output2.txt after the summariser.
Works at least with macOS High Sierra with Python 3.6.3.
Example
Here's a diff of Pride and Prejudice and the first pass output.txt:

Source code
https://github.com/hugovk/NaNoGenMo-2017/tree/master/03-reducifier
Ha, this is great! 60% reduced Pride and Prejudice is still totally readable.
Too bad the summarizer took out all the damns.
"Remove honorifics (Mr., Mrs., Miss, Dr.)" 😱 How can I then tell the "Bennet"s apart?!
@janelleshane Cliff-notes are also readable.
Great
@henrikh @danesparza Yep, I did realise that but unfortunately they just had to go to reduce the word count :) I should have replaced "Mrs. Bennet" with her maiden name, "Gardiner"!
Sometimes you will see major characters referred to with a shortened version of the name after introduction. I would suggest calling Mrs. Bennet Mrs. B, Mr. Bennet Mr. B. You don't remove honorifics and reduce word count but you reduce character count.
Actually considering the patriarchy Mr. B can just be B.
on edit: Ms can be used in place of Mrs. in modern times of course.
@bryanrasmussen Word count is all that matters :)
Not if your last name is Hugo, and your first Victor!
On Tue, Dec 5, 2017 at 9:22 AM, Hugo [email protected] wrote:
@bryanrasmussen https://github.com/bryanrasmussen Word count is all that matters :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NaNoGenMo/2017/issues/130#issuecomment-349230188, or mute the thread https://github.com/notifications/unsubscribe-auth/AATEQMWinK_SBHnu2ojlLhTpdjfhqcV6ks5s9P1AgaJpZM4Qwz2t .
PS. Using the 't' contraction instead of 'the' makes this really hard to parse.
Only in some cases.
"...by a young man of large fortune from t'north of England;"[1]
This is just about the perfect edit.
[1] https://github.com/hugovk/NaNoGenMo-2017/blob/master/03-reducifier/output.txt#L35
:)
See https://news.ycombinator.com/item?id=15823499 for more discussion.
@henrikh you'd have to make do with context, I suppose, but that's not all that different than the base text because only the eldest daughter is addressed by only her surname ("Miss Bennet") whereas the younger daughters are addressed with either their first or full names ("Miss Elizabeth" / "Miss Elizabeth Bennet"). I haven't read Pride and Prejudice in a while, are there any examples where the reader must discern identity (among Bennets or any other family) from context?
@philsnow As far as I recall, Elizabeth is actually referred to as "Miss Bennet" when adressed directly by Mr Darcy and Mr Wickham -- but, of course, in those situations there would be no doubt :wink: