WiktionaryParser
WiktionaryParser copied to clipboard
explore mediawiki parsers instead of parsing HTML directly
Instead of parsing the HTML, use existing mediawiki parsers (like mwparserfromhell) as a second stage since headings/content/tags/comments etc are clearly defined and the wikitext content is more compact
Hello Suyash, I may like to work on this if I have the time. Some questions:
- Are you aiming to use mwparserfromhell to read the HTML content parsed by beautifulsoup or to read the markdown in wiktionary XML dump files?
- What are the prerequisites for you accepting a pull request? (as I see some pull requests haven't been merged)
Hi @sehwol , thank you for your interest in this! I was planning to use mwparserfromhell to parse the wikitext directly instead of HTML mainly for the following reasons
-
I've seen a lot of pages having the same kind of content in different html tags or structures so was hoping that this would be more resilient. Since we won't be dealing with HTML, we won't have to clean it up like we're doing here
-
If the parser works on wikitext, it'll be easy to make it work with wikitext dumps later on instead of making HTTP calls
The wikitext can be retrieved using Wiktionary's API https://en.wiktionary.org/w/api.php?action=parse&page=test&prop=wikitext&formatversion=2&format=json
I'll accept a PR if the tests work and the code looks good to me. The two pending ones aren't really complete so I haven't merged them yet.
Hi Suyash, do the tests all run on your computer?
If I fetch words like "video (Latin)" oldid 50291344, I'm sometimes getting stuff like this
...
"text": [
"Lua error in Module:la-verb at line 747: The parameter \"conj\" is not used by this template.",
"I see, perceive; look (at)",
...
Source: https://en.wiktionary.org/wiki/video?printable=yes&oldid=50291344#Verb_2
I'm not sure if wiktionary just developed a bug or if it's something else.
Edit:
I've started splitting up the tests and adding a bit more logging so people can tell which word and language specifically is failing a test. This does mean that I'm adding parameterized==0.7.4 as a dependency.
Tbh I haven't worked on this project in a while but, I'll take a look at the tests right away. The exception looks like an error on Wiktionary's end that turns up when the wikitext is rendered. Adding parameterized==0.7.4 sounds like a great idea 😊 , do you mind creating a separate issue for this and submitting a PR once you've fixed the tests?
I think this is one of the most comprehensive parsers which does that: https://github.com/tatuylonen/wiktextract
@frankier this looks very promising, thanks for pointing out!