mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Infobox not included in template

Open kireet opened this issue 6 years ago • 2 comments

I am looking for a simple way to extract the first paragraph of the first section from wikipedia pages. I tried to get the first section and process the text/link nodes of that section, but it doesn't seem to work reliably. E.g (using the parse method from the readme):

p = parse('Arthur Jensen')
for n in p.get_sections()[0].filter_text()[:6]:
    print(n)

prints

about
the Danish actor
Arthur Jensen (actor)
the New Zealand musician and composer
Arthur Owen Jensen
{{Infobox scientist
|name=Arthur Jensen
   |birth_name              = Arthur Robert Jensen
   |image             = Arthur Jensen Vanderbilt 2002.jpg
   |image_size        = 200px
   |caption           = Arthur Jensen, 2002 at 

the infobox template also doesn't seem to be returned by filter_templates? print([t.name for t in p.filter_templates()]) prints:

['about', 'Birth date', 'Death date and age', 'cite journal', 'cite web ', 'Webarchive', 'Says who', 'cite book ', 'cite book', 'cite book ', 'cite web', 'cite book ', 'cite journal ', 'cite book', 'cite book', 'quote', 'Cite journal', 'quote', 'cite web ', 'quote', 'cite journal ', 'cite web ', 'cite book ', 'cite book ', 'cite book ', 'cite web ', 'cite book ', 'cite journal', 'Cite news', 'quote', 'cite journal', 'cite journal', 'cite journal', 'cite web ', 'citation needed', 'webarchive ', 'quote', 'quote', 'quote', 'quote', 'cite book ', 'See also', 'Cite journal', 'cite journal ', 'cite journal ', 'cite journal ', 'Cite book', 'cite journal ', 'cite journal ', 'cite journal ', 'cite web ', 'Reflist', 'ISBN', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'Google Scholar id', 'Authority control', 'DEFAULTSORT:Jensen, Arthur']

kireet avatar Apr 10 '19 21:04 kireet

Sorry this took a while to get a response.

The template is missing in this case because there's a syntax inconsistency with a bold tag (see #40). You can work around this with parse(text, skip_style_tags=True).

To solve your original question, you can try something like this:

>>> code = parse(text, skip_style_tags=True)
>>> print(code.strip_code().splitlines()[0])
'''Arthur Robert Jensen''' (August 24, 1923 – October 22, 2012) was an American psychologist and author. He was a professor of educational psychology at the University of California, Berkeley.  Jensen was known for his work in psychometrics and differential psychology, the study of how and why individuals differ behaviorally from one another.

That would give the first paragraph as a string, with formatting removed. (If you want it as pure text, without the style tags either, you can reparse the text with skip_style_tags=False and call strip_code again...)

If you want the actual nodes from the first paragraph, you could combine get_sections()]0] with an second step to remove any templates before the first non-whitespace text node.

earwig avatar Jun 30 '19 04:06 earwig