WikiPlots
WikiPlots copied to clipboard
'plot' in current.get_text().lower() doesn't match all relevant headers
In a similar project of mine, I used this regexp:
PLOT = [
'Plot summary', 'Plot', 'Plot introduction',
'Synopsis', 'Summary', 'Plot synopsis',
'Overview', 'Story', 'Description' , 'Contents?'
]
HEADING_RE = re.compile(
r'^ *=+\s*(%s)\s*=+' % '|'.join(PLOT),
re.IGNORECASE | re.UNICODE | re.MULTILINE)
Thanks for the suggestion. This might pick up some things that aren't novels, movies, or video games though. I try it out and see.
Right. That's why I used Wikipedia categories, which may be too big a change for your script. FWIW, here's the breakdown of headers in articles about novels: Plot summary 8466 Plot 5664 Plot introduction 1696 Synopsis 1492 Summary 636 Plot Summary 314 Plot synopsis 213 Overview 212 Story 124 Description 97 Contents 67 Content 53