wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

Quick question about declension tables

Open daxida opened this issue 1 month ago • 3 comments

What is the official stance about declension tables containing articles?

I checked https://github.com/tatuylonen/wiktextract/issues/161 and the docs but I could not find a clear answer.

They are trimmed in the Greek edition, and from a quick browsing of the English edition, most declension tables don't have articles aqua κόκκινο so there is no problem there. The issue is with German, where the articles are still there, and cause major issues when making dictionaries. Of course, this is not a wiktextract issue properly speaking and you may as well not care.

And if there is a definite answer, could it be added somewhere in the docs? I know it's supposed to be for creating extractors but I think it's a great resource for clearing up doubts about conventions. I would suggest doing the same with tags and raw_tags from https://github.com/tatuylonen/wiktextract/issues/1418 (should tags be sorted, not sorted, don't care etc.), but I guess if there is no consensus then it's tricky.

daxida avatar Dec 05 '25 07:12 daxida

I think pronoun and article data should be separated from the form words if they could be detected easily and reliably. For example, in German Wiktionary's Deutsch Verb Übersicht table template, there is a column of pronoun data and they are extracted to the "Form.pronouns" list. But most other table templates put both article and form word in the same column and also in plain text therefore they are not separated.

xxyzz avatar Dec 08 '25 00:12 xxyzz

Article data in Deutsch Substantiv Übersicht template added to Form.article: #1545

xxyzz avatar Dec 08 '25 02:12 xxyzz

If the articles are simple to detect and remove, they should be, unless there's some linguistic reason why they shouldn't -- for example, if the articles can be irregular, as in, you can't predict what the article's form is based on the other information found in the data. This goes roughly for other particles and clitics and such; affixes without spaces would be too bothersome to remove, so those can be left as is.

kristian-clausal avatar Dec 08 '25 05:12 kristian-clausal