[el] Parsing error in headers
When parsing headers content, some entries have tag-related information appear in their form. F.e. αιδώς
wiktwords --db-path tmp/el_latest.db --edition el --language-code el --out outfile --page αιδώς
2025-10-13 08:49:50,247 INFO: Capturing words for: el
cat outfile | jq -c '{forms}' --indent 2
{
"forms": [
{
"form": "αιδώς",
"raw_tags": [
"θηλυκό",
"μόνο στον ενικό"
]
},
{
"form": "λόγιο", // <-------- λόγιο := literary, this should not be here
"raw_tags": [
"θηλυκό",
"μόνο στον ενικό"
]
}
]
}
Parsing that information into a tag would be nice but probably the most important thing is to remove them as forms.
How frequently does this appear?
No idea. I had this snippet for diagnosing in extractor/el/head.py:
case ")":
inside_parens = False
# print(f"{current_forms=}, {current_tags=}, {t=}")
if (
not current_forms
and len(current_tags) == 1
and code_to_name(current_tags[0]) != ""
):
# There are a lot of `(en)` language code tags that we
# don't care about because they're just repeating the
# language code of the word entry itself!
current_tags = []
continue
# --------- ADDED THIS
if not current_tags:
if current_forms == ["λόγιο"]:
# dirty hack: to remove λόγιο from forms
# TESTED: it does not affect λόγιο itself
current_forms = []
continue
with open("parens_cases.txt", "a") as f:
w = wxr.wtp.title
f.write(f"{w} -- {current_forms}\n")
# ---------
if current_forms and current_tags:
push_new_block()
else:
extend_old_block()
And from the "parens_cases.txt" file (other examples of λόγιο can be found), more problems arise. F.e. γάιδαρος
{
"forms": [
{
"form": "γάιδαρος",
"raw_tags": [
"αρσενικό", // <---- masculine OK
"θηλυκό" // <---- feminine NOK
]
},
{
"form": "γαϊδάρα",
"raw_tags": [
"αρσενικό", // <---- masculine NOK
"θηλυκό" // <---- feminine OK
]
},
{
"form": " ", // <--- should not be here
"raw_tags": [
"αρσενικό",
"θηλυκό"
]
},
{
"form": "γαϊδούρα",
"raw_tags": [
"αρσενικό", // <---- masculine NOK
"θηλυκό" // <---- feminine OK
]
}
]
}
When trying to fix the λόγιο one, I had this test in test_el_head.py, maybe it can help iterate a solution:
!!! It requires from wiktextract.extractor.el.page import parse_page on top of the file.
def test_parsing_logio(self) -> None:
# https://el.wiktionary.org/wiki/αιδώς
# Test that logio (literary) is correctly parsed
self.wxr.wtp.add_page("Πρότυπο:-el-", 10, "Greek")
self.wxr.wtp.add_page("Πρότυπο:ουσιαστικό", 10, "Ουσιαστικό")
self.wxr.wtp.add_page(
"Πρότυπο:ετ",
10,
"""([[:Κατηγορία:Λόγιοι όροι (νέα ελληνικά)|<i>λόγιο</i>]])[[Κατηγορία:Λόγιοι όροι (νέα ελληνικά)]]""",
)
self.wxr.wtp.add_page(
"Πρότυπο:θεν",
10,
"""<span style="background:#ffffff; color:#002000;">''θηλυκό, μόνο στον ενικό''</span>""",
)
self.wxr.wtp.add_page(
"Πρότυπο:κλείδα-ελλ",
10,
"""[[Κατηγορία:Αντίστροφο λεξικό (ελληνικά)|σωδια]]""",
)
raw = """=={{-el-}}==
==={{ουσιαστικό|el}}===
'''{{PAGENAME}}''' {{θεν}} {{ετ|λόγιο}}
"""
word = "αιδώς"
page_datas = parse_page(self.wxr, word, raw)
received = page_datas[0]["forms"]
expected = [
{"form": "αιδώς", "raw_tags": ["θηλυκό", "μόνο στον ενικό"]},
]
self.assertEqual(received, expected)
The "λόγιο" form is most likely because it's a link... But the code doesn't check what it's linking two, in this case a category page. That should be an easy fix.
The issue with Template:et (except in greek it's a pain to keep on copy-pasting it, you know what I mean...) should be handled here. :Category: links should be handled, as should HTML italics and bold tags. I also changed the logic for accepting forms a bit so that if there is bold text in the head, italics in links should not be accepted as a form.
Thank you.
I skimmed over the PR, and while I can't say I understand everything that is going on, the original issue seems fixed.
LGTM.
The γάιδαρος case above is still wrong. I made a test for it but I did not delve deeper into what was causing the problem. Do you want me to post it here or on a separate issue?
Please post the test, I'll copy-paste it, or make a PR and I'll pull it into a separate branch.
Thank you again.
I had it in test_el_head.py
def test_parsing_forms_and_tags(self) -> None:
# https://el.wiktionary.org/wiki/γάιδαρος
self.wxr.wtp.add_page("Πρότυπο:-el-", 10, "Greek")
self.wxr.wtp.add_page("Πρότυπο:ουσιαστικό", 10, "Ουσιαστικό")
self.wxr.wtp.add_page(
"Πρότυπο:α",
10,
"""<span style="background:#ffffff; color:#002000;">''αρσενικό''</span>""",
)
self.wxr.wtp.add_page(
"Πρότυπο:θ",
10,
"""(<span style="background:#ffffff; color:#002000;">''θηλυκό''</span> '''[[γαϊδάρα]]''' ''ή'' '''[[γαϊδούρα]]''')""",
)
raw = """=={{-el-}}==
==={{ουσιαστικό|el}}===
'''{{PAGENAME}}''' {{α}} {{θ|γαϊδάρα|ή=γαϊδούρα}}
"""
word = "γάιδαρος"
page_datas = parse_page(self.wxr, word, raw)
received = page_datas[0]["forms"]
expected = [
{"form": "γάιδαρος", "raw_tags": ["αρσενικό"]},
{"form": "γαϊδάρα", "raw_tags": ["θηλυκό"]},
{"form": "γαϊδούρα", "raw_tags": ["θηλυκό"]},
]
self.assertEqual(received, expected)
EDIT: there was a mistake in raw_tags
The gaidaros errors are related to the logic in parse_head, I'll take a deeper look tomorrow.
Changed the logic for parenthesised items: if we come across the start of a bold item, and there are tags and forms in the current 'buffer', start a new item (otherwise just add either the tags or the forms or nothing to the previous entry), unless we're inside a parenthesis. At some point this needs to be rewritten with the current tests in mind, but that day isn't today.