mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Problem with sections from RfA pages

Open ananth1996 opened this issue 5 years ago • 3 comments

I'm trying to parse the sections from RfA pages such as https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship/7. Using the get_sections() seems to always return 1 even if I use skip_style_tags=True . Is there any fix for this? The filter_headings() functions returns all the headings? I want to parse the Support, Oppose and Negate votes. Is there any better way to do this in python?

ananth1996 avatar Jun 07 '19 09:06 ananth1996

Hi @ananth1996,

The issue is basically that the entire RfA content is inside a <div> tag, and get_sections() expects headings to be nodes at the top level of the wikicode. Since all headings are inside that <div>, it considers the entire page to be one section.

Here's a cheap workaround:

>>> code = mwparserfromhell.parse(text, skip_style_tags=True)
>>> if code:
...     first = code.get(0)
...     if isinstance(first, mwparserfromhell.nodes.Tag) and first.tag == 'div':
...         code = first.contents
...
>>> len(code.get_sections())
9

I'll think more about a way to fix this inside the parser.

earwig avatar Jun 09 '19 19:06 earwig

Thank you for the workaround, it is working properly. I also wanted to ask if there is any way particular way to iterate through list items such as some methods in wikitextparser?. I am also looking to extract the user signature at the end of every vote and was wondering if there is a template or general regex pattern already available in some parser. Thanks in advance.

ananth1996 avatar Jun 10 '19 08:06 ananth1996

I don’t think there’s a good built-in way to do that, unfortunately. You would need to do some manual node iteration. For example: for each unnested li tag, find the last wikilink to a user page or user talk page before the next li tag. Something like that might work.

On Jun 10, 2019, at 4:06 AM, Ananth Mahadevan [email protected] wrote:

Thank you for the workaround, it is working properly. I also wanted to ask if there is any way particular way to iterate through list items such as some methods in wikitextparser?. I am also looking to extract the user signature at the end of every vote and was wondering if there is a template or general regex pattern already available in some parser. Thanks in advance.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

earwig avatar Jun 10 '19 11:06 earwig