mwparserfromhell
mwparserfromhell copied to clipboard
Problem with sections from RfA pages
I'm trying to parse the sections from RfA pages such as https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship/7. Using the get_sections() seems to always return 1 even if I use skip_style_tags=True
. Is there any fix for this? The filter_headings()
functions returns all the headings?
I want to parse the Support, Oppose and Negate votes. Is there any better way to do this in python?
Hi @ananth1996,
The issue is basically that the entire RfA content is inside a <div>
tag, and get_sections()
expects headings to be nodes at the top level of the wikicode. Since all headings are inside that <div>
, it considers the entire page to be one section.
Here's a cheap workaround:
>>> code = mwparserfromhell.parse(text, skip_style_tags=True)
>>> if code:
... first = code.get(0)
... if isinstance(first, mwparserfromhell.nodes.Tag) and first.tag == 'div':
... code = first.contents
...
>>> len(code.get_sections())
9
I'll think more about a way to fix this inside the parser.
Thank you for the workaround, it is working properly.
I also wanted to ask if there is any way particular way to iterate through list items such as some methods in wikitextparser
?. I am also looking to extract the user signature at the end of every vote and was wondering if there is a template or general regex pattern already available in some parser.
Thanks in advance.
I don’t think there’s a good built-in way to do that, unfortunately. You would need to do some manual node iteration. For example: for each unnested li tag, find the last wikilink to a user page or user talk page before the next li tag. Something like that might work.
On Jun 10, 2019, at 4:06 AM, Ananth Mahadevan [email protected] wrote:
Thank you for the workaround, it is working properly. I also wanted to ask if there is any way particular way to iterate through list items such as some methods in wikitextparser?. I am also looking to extract the user signature at the end of every vote and was wondering if there is a template or general regex pattern already available in some parser. Thanks in advance.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.