wikiextractor
wikiextractor copied to clipboard
Option to keep == Section == syntax around titles
It would be nice to have this option, knowing that a particular bit of text is a section title, and what level the section is, is useful for some downstream analysis tasks.
Currently:
Foo bar.
Foo bar is blah blah blah....
Desired
= Foo bar =
Foo bar is blah blah blah....
In V3.0.6, this can be solved by changing the default argument mark_headers=False to mark_headers=True at extract.Extractor.clearn_text. Then headings start with #, e.g. "## Section 1".