WiktionaryParser icon indicating copy to clipboard operation
WiktionaryParser copied to clipboard

explore mediawiki parsers instead of parsing HTML directly

Open suyashb95 opened this issue 5 years ago • 6 comments

Instead of parsing the HTML, use existing mediawiki parsers (like mwparserfromhell) as a second stage since headings/content/tags/comments etc are clearly defined and the wikitext content is more compact

suyashb95 avatar Jul 09 '20 16:07 suyashb95

Hello Suyash, I may like to work on this if I have the time. Some questions:

  • Are you aiming to use mwparserfromhell to read the HTML content parsed by beautifulsoup or to read the markdown in wiktionary XML dump files?
  • What are the prerequisites for you accepting a pull request? (as I see some pull requests haven't been merged)

ghost avatar Sep 03 '20 17:09 ghost

Hi @sehwol , thank you for your interest in this! I was planning to use mwparserfromhell to parse the wikitext directly instead of HTML mainly for the following reasons

  • I've seen a lot of pages having the same kind of content in different html tags or structures so was hoping that this would be more resilient. Since we won't be dealing with HTML, we won't have to clean it up like we're doing here

  • If the parser works on wikitext, it'll be easy to make it work with wikitext dumps later on instead of making HTTP calls

The wikitext can be retrieved using Wiktionary's API https://en.wiktionary.org/w/api.php?action=parse&page=test&prop=wikitext&formatversion=2&format=json

I'll accept a PR if the tests work and the code looks good to me. The two pending ones aren't really complete so I haven't merged them yet.

suyashb95 avatar Sep 04 '20 04:09 suyashb95

Hi Suyash, do the tests all run on your computer?

If I fetch words like "video (Latin)" oldid 50291344, I'm sometimes getting stuff like this

...
                    "text": [
                        "Lua error in Module:la-verb at line 747: The parameter \"conj\" is not used by this template.",
                        "I see, perceive; look (at)",
...

Source: https://en.wiktionary.org/wiki/video?printable=yes&oldid=50291344#Verb_2

I'm not sure if wiktionary just developed a bug or if it's something else.

Edit: I've started splitting up the tests and adding a bit more logging so people can tell which word and language specifically is failing a test. This does mean that I'm adding parameterized==0.7.4 as a dependency.

split-tests

ghost avatar Sep 04 '20 11:09 ghost

Tbh I haven't worked on this project in a while but, I'll take a look at the tests right away. The exception looks like an error on Wiktionary's end that turns up when the wikitext is rendered. Adding parameterized==0.7.4 sounds like a great idea 😊 , do you mind creating a separate issue for this and submitting a PR once you've fixed the tests?

suyashb95 avatar Sep 04 '20 15:09 suyashb95

I think this is one of the most comprehensive parsers which does that: https://github.com/tatuylonen/wiktextract

frankier avatar Mar 11 '21 09:03 frankier

@frankier this looks very promising, thanks for pointing out!

suyashb95 avatar Mar 11 '21 11:03 suyashb95