wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

[el] Parsing error in headers

Open daxida opened this issue 3 months ago • 7 comments

When parsing headers content, some entries have tag-related information appear in their form. F.e. αιδώς

wiktwords --db-path tmp/el_latest.db --edition el --language-code el --out outfile --page αιδώς
2025-10-13 08:49:50,247 INFO: Capturing words for: el
cat outfile | jq -c '{forms}' --indent 2
{
  "forms": [
    {
      "form": "αιδώς",
      "raw_tags": [
        "θηλυκό",
        "μόνο στον ενικό"
      ]
    },
    {
      "form": "λόγιο", // <-------- λόγιο := literary, this should not be here
      "raw_tags": [
        "θηλυκό",
        "μόνο στον ενικό"
      ]
    }
  ]
}

Parsing that information into a tag would be nice but probably the most important thing is to remove them as forms.


How frequently does this appear?

No idea. I had this snippet for diagnosing in extractor/el/head.py:

            case ")":
                inside_parens = False
                # print(f"{current_forms=}, {current_tags=}, {t=}")
                if (
                    not current_forms
                    and len(current_tags) == 1
                    and code_to_name(current_tags[0]) != ""
                ):
                    # There are a lot of `(en)` language code tags that we
                    # don't care about because they're just repeating the
                    # language code of the word entry itself!
                    current_tags = []
                    continue

                # --------- ADDED THIS
                if not current_tags:
                    if current_forms == ["λόγιο"]:
                        # dirty hack: to remove λόγιο from forms
                        # TESTED: it does not affect λόγιο itself
                        current_forms = []
                        continue
                    with open("parens_cases.txt", "a") as f:
                        w = wxr.wtp.title
                        f.write(f"{w} -- {current_forms}\n")
                # ---------

                if current_forms and current_tags:
                    push_new_block()
                else:
                    extend_old_block()

And from the "parens_cases.txt" file (other examples of λόγιο can be found), more problems arise. F.e. γάιδαρος

{
  "forms": [
    {
      "form": "γάιδαρος",
      "raw_tags": [
        "αρσενικό", // <---- masculine OK
        "θηλυκό"    // <---- feminine NOK
      ]
    },
    {
      "form": "γαϊδάρα",
      "raw_tags": [
        "αρσενικό", // <---- masculine NOK
        "θηλυκό"    // <---- feminine OK
      ]
    },
    {
      "form": "&nbsp;", // <--- should not be here
      "raw_tags": [
        "αρσενικό",
        "θηλυκό"
      ]
    },
    {
      "form": "γαϊδούρα",
      "raw_tags": [
        "αρσενικό", // <---- masculine NOK
        "θηλυκό"    // <---- feminine OK
      ]
    }
  ]
}

When trying to fix the λόγιο one, I had this test in test_el_head.py, maybe it can help iterate a solution:

!!! It requires from wiktextract.extractor.el.page import parse_page on top of the file.

   def test_parsing_logio(self) -> None:
        # https://el.wiktionary.org/wiki/αιδώς
        # Test that logio (literary) is correctly parsed
        self.wxr.wtp.add_page("Πρότυπο:-el-", 10, "Greek")
        self.wxr.wtp.add_page("Πρότυπο:ουσιαστικό", 10, "Ουσιαστικό")
        self.wxr.wtp.add_page(
            "Πρότυπο:ετ",
            10,
            """([[:Κατηγορία:Λόγιοι όροι (νέα ελληνικά)|<i>λόγιο</i>]])[[Κατηγορία:Λόγιοι όροι  (νέα ελληνικά)]]""",
        )
        self.wxr.wtp.add_page(
            "Πρότυπο:θεν",
            10,
            """<span style="background:#ffffff; color:#002000;">''θηλυκό, μόνο στον ενικό''</span>""",
        )
        self.wxr.wtp.add_page(
            "Πρότυπο:κλείδα-ελλ",
            10,
            """[[Κατηγορία:Αντίστροφο λεξικό (ελληνικά)|σωδια]]""",
        )

        raw = """=={{-el-}}==
==={{ουσιαστικό|el}}===
'''{{PAGENAME}}''' {{θεν}} {{ετ|λόγιο}}
"""
        word = "αιδώς"
        page_datas = parse_page(self.wxr, word, raw)
        received = page_datas[0]["forms"]

        expected = [
            {"form": "αιδώς", "raw_tags": ["θηλυκό", "μόνο στον ενικό"]},
        ]

        self.assertEqual(received, expected)

daxida avatar Oct 13 '25 07:10 daxida

The "λόγιο" form is most likely because it's a link... But the code doesn't check what it's linking two, in this case a category page. That should be an easy fix.

kristian-clausal avatar Oct 13 '25 07:10 kristian-clausal

The issue with Template:et (except in greek it's a pain to keep on copy-pasting it, you know what I mean...) should be handled here. :Category: links should be handled, as should HTML italics and bold tags. I also changed the logic for accepting forms a bit so that if there is bold text in the head, italics in links should not be accepted as a form.

kristian-clausal avatar Oct 13 '25 08:10 kristian-clausal

Thank you.

I skimmed over the PR, and while I can't say I understand everything that is going on, the original issue seems fixed.

LGTM.


The γάιδαρος case above is still wrong. I made a test for it but I did not delve deeper into what was causing the problem. Do you want me to post it here or on a separate issue?

daxida avatar Oct 13 '25 09:10 daxida

Please post the test, I'll copy-paste it, or make a PR and I'll pull it into a separate branch.

kristian-clausal avatar Oct 13 '25 09:10 kristian-clausal

Thank you again.

I had it in test_el_head.py

    def test_parsing_forms_and_tags(self) -> None:
        # https://el.wiktionary.org/wiki/γάιδαρος
        self.wxr.wtp.add_page("Πρότυπο:-el-", 10, "Greek")
        self.wxr.wtp.add_page("Πρότυπο:ουσιαστικό", 10, "Ουσιαστικό")
        self.wxr.wtp.add_page(
            "Πρότυπο:α",
            10,
            """<span style="background:#ffffff; color:#002000;">''αρσενικό''</span>""",
        )
        self.wxr.wtp.add_page(
            "Πρότυπο:θ",
            10,
            """(<span style="background:#ffffff; color:#002000;">''θηλυκό''</span> '''[[γαϊδάρα]]'''&nbsp;''ή''  '''[[γαϊδούρα]]''')""",
        )

        raw = """=={{-el-}}==
==={{ουσιαστικό|el}}===
'''{{PAGENAME}}''' {{α}} {{θ|γαϊδάρα|ή=γαϊδούρα}}
"""
        word = "γάιδαρος"
        page_datas = parse_page(self.wxr, word, raw)
        received = page_datas[0]["forms"]

        expected = [
            {"form": "γάιδαρος", "raw_tags": ["αρσενικό"]},
            {"form": "γαϊδάρα", "raw_tags": ["θηλυκό"]},
            {"form": "γαϊδούρα", "raw_tags": ["θηλυκό"]},
        ]

        self.assertEqual(received, expected)

EDIT: there was a mistake in raw_tags

daxida avatar Oct 13 '25 09:10 daxida

The gaidaros errors are related to the logic in parse_head, I'll take a deeper look tomorrow.

kristian-clausal avatar Oct 13 '25 11:10 kristian-clausal

Changed the logic for parenthesised items: if we come across the start of a bold item, and there are tags and forms in the current 'buffer', start a new item (otherwise just add either the tags or the forms or nothing to the previous entry), unless we're inside a parenthesis. At some point this needs to be rewritten with the current tests in mind, but that day isn't today.

kristian-clausal avatar Oct 14 '25 07:10 kristian-clausal