Quick question about tags
- Are
tags/raw_tagsexpected to be sorted / unique, and if so, at which point of the program: only at the end, during the whole program etc.? - For non-en editions, are
raw_tagsexpected to be removed once they are translated intotags?
While the docs state that:
we move tags that could be converted to tags to "tags" or "topics" field.
I understand this as expected to be removed once they are translated but the snippet seems to be deleting every raw_tag. I would expect something like:
def translate_raw_tags(data: WordEntry):
raw_tags = []
for raw_tag in data.raw_tags:
if raw_tag in TAGS:
data.tags.append(TAGS[raw_tag])
elif raw_tag in TOPICS:
data.topics.append(TOPICS[raw_tag])
else:
raw_tags.append(raw_tag)
data.raw_tags = raw_tags
Also, somewhat related, the type hint of this generic translate_raw_tags seems to be violated in a couple languages that I've checked, where it is called with a Form list, Sound etc.
sound = Sound(ipa=ipa, raw_tags=raw_tags)
translate_raw_tags(sound)
I perfectly understand why this happens but it remains a bit confusing on first sight.
The ultimate list of tags / raw_tags should be shorted.
Keep untranslated raw_tags for the purposes of later either creating dictionary entries or code that can discard or translate them as needed for each raw_tag.
The type hints should be improved.
Also, Tatu wants tags to often be what are actually "tagsets" fields, list of lists of strings; each tagset is a list of tags that is an alternative to the other tagsets. But that is often not implemented, but should be.
IMO usually tags don't need to be sorted, especially if the tags are added in an order related to the wikitext structure. For example, the tags of zh-pron are start from the top most list and tags extracted from table caption are before row/column headers. But this is not a required order, the tags just conveniently added while walk the wikitext parse tree. Also I feel it's a bit unnecessary to sort the tags for these two specific cases.
Tags list shouldn't have duplicated tags but this is probably not checked at some places.
And Python typing is more like comments rn, I'm afraid if we want to pass mypy the code will be more difficult to read.
As long as tests pass, so sorting after a non-deterministic operation like set -> list is probably the only thing that is really required.
EDIT:
Because different type checkers might have different behavior, we suggest using mypy when working on wiktextract.
Thank you for your answer.
Another quick question: when considering nouns, where should tags about basic grammatical features ("masculine/feminine", "singular/plural" etc.) be stored?
This, sort of, continues this conversation
I see that English puts them in senses even though they may come from the header. F.e. γυναίκα
It is my understanding that other editions like Russian, may have these at the WordEntry level. This is based on indirect knowledge via this other repo kaikki-to-yomitan . Feel free to correct me on that one.
At any rate, in Greek these features are stored as Form tags, which makes the above repo ignore them altogether. The way I see it, if I want to change that, I can either:
- Propagate upwards, from
Formtags toWordEntrytags (Russian) - Propagate downwards, from
Formtags toSensetags (English) - Do not change anything here, but add logic to the other repo
I suppose that the English approach is the reference, but it feels so wasteful to jam those tags downwards when I can not possibly see how they could differ from sense to sense.
There are words in Greek like όρος (2 words, one masc. other neut.) or Ίκαρος (1 word, 2 different POS), that share radical, but there's never, as far as I can tell, under a POS, a variation in gender/number.
In most non-en extractors, WordEntry.tags are applied to all Sense lists under the POS section, they could come from POS section itself(POS_DATA dictionary) and headword line. Sense.tags come from definition list and only applied to that single list.
I'm not familiar with el edition wikitext layout but this feels so strange to move Form.tags around. Form.tags should only have tags of an inflection form, like "plural" for form "books" in page "book", it has nothing to do with WordEntry or Sense.
I guess you mean the '''{{PAGENAME}}''' {{α}} in page "Ίκαρος" is extracted as a Form. This bold word should only be added to forms list if it's different than the page title and also add the "canonical" tag. Example page: zh edition Russian word книга. And the tag templates in headword line should be added to WordEntry if it applies to the base form word unless it's for other inflection forms.
I'm not sure what the POS section refers to exactly for Greek. As far as I know, forms (that could potentially relate to WordEntry) can come from either the headword line, or the declension table.
I'm not familiar with el edition wikitext layout but this feels so strange to move Form.tags around. Form.tags should only have tags of an inflection form, like "plural" for form "books" in page "book", it has nothing to do with WordEntry or Sense.
Even if the two examples above contain the POS in the headword, most of the time, the headword has no such information: γρήγορος. You can however, by looking at the table, infer that it is masculine singular, since it is the only matching form. This is what I meant by moving forms around.
I guess you mean the '''{{PAGENAME}}''' {{α}} in page "Ίκαρος" is extracted as a Form. This bold word should only be added to forms list if it's different than the page title and also add the "canonical" tag. Example page: zh edition Russian word книга. And the tag templates in headword line should be added to WordEntry if it applies to the base form word unless it's for other inflection forms.
This again, I considered moving, but my wording may have been off. The current Greek extractor:
- Adds a Form even if it is the equal to PAGENAME (The книга > кни́га case I have not experienced, PAGENAME seems to always be the canonical form)
- Does not populate
WordEntrywith the respective tags. i.e. for Ίκαρος, there is nomasculinetag (nor the equivalentαρσενικόraw_tag) at the end, at theWordEntrylevel.
For reference, the Ίκαρος page in kaikki. I don't think I posted it before.
POS is part of speech, like "noun", "verb", etc. It's always taken from the section header, which is directly under language. "masculine singular" are just gender and number.
I'm away from the office today (we're moving things around in the archives... by which I mean pallets of boxes of books, to another location entirely. There's a lot of pallets...) so I'll take a look at this later.
In page "Ίκαρος", ==={{κύριο όνομα|el}}=== and ==={{ουσιαστικό|el}}=== are part of speech(POS) sections, and nodes between the POS section and definition list are headword line nodes.
If the tag is extracted from a table for the base form word I think you could add it to WordEntry.
If a tag is extracted from the headword line and it's for the base form word I think it also should be added to WordEntry. This is implemented in some non-en editions because I think it makes sense. I didn't realize en edition code adds the tag to forms list if the word form is different than page title otherwise adds it to senses list until now, IMO this is unnecessary...
PAGENAME is a MediaWiki magic word and it's always the page title but a page may use a template like абаеведение.