wiktextract
wiktextract copied to clipboard
Citation templates being included in etymology_templates lists
In page.py
, there is effort to avoid including citation templates in etymology_templates
lists (see ignored_etymology_templates_re
and its use in etym_post_template_fn()
. However, I noticed there are quite a few citation templates being included in the output. Here is a non-exhaustive list of some I noticed:
archivedotorg
cite book
Cite book
cite journal
cite web
ISSN
lang
num
R2A
w
Since these templates generally are not in the wikitext itself, I understand they are being added by the recursive expansion of templates invoked by templates that are in the wikitext. Often though, the only citation templates that are in the wikitext are ones that are currently ignored, like R:*
templates. So, while any R:*
template won't itself be added to etymology_templates
, all the templates invoked by it (that don't match ignored_etymology_templates_re
) will be, which seems strange.
I would think there must be a way to disallow such behavior by tweaking the template_fn
s and post_template_fn
s passed to clean_node()
, but I wasn't able to figure it out in the brief time I spent on it.
Of course, the easiest fix would be just to add the templates I listed to the ignored list. But there are surely others I missed that don't belong as well.
I've committed a temporary kludge to this issue, based on what Tatu recommended, but I'm unhappy with it. At least this should fix stuff like in your mileage may vary, where the cite-journal template was broken up into subtemplates like ISSN, as you've noted, but I am pretty sure the intent of the code has been (as far as I understood) for non-etymology templates being fed into a normal header expansion function that does something 'normal' with it, ie. the citation should appear somewhere. But at least now we don't have garbage data in the etymologies.
Another issue is that the templates ´Cite book´ and ´cite book´ seem to be some kind of aliases of ´cite-book´! I have no idea if this is some kind of redirect, but if you search for Template:Cite book it throws you into Template:cite-book, and this is definitely something we should handle somewhere, if only because |cite-| is part of the regex used to generated ignored_etymology_templates_re. It might be a redirect or some kind of string canonicalization going on.
{{w}} might also need special handling. It's just a short hand template for Wikipedia links when the link text and article name are the same.
Tatu checked my code and fixed some big oversights with the recursion counting logic.
The "Cite book" and "cite book" templates are redirects to Template:cite-book, and the most sensible thing to do is just gather redirect data for them and then add them to ignore_etymology_templates.
Great, thanks for your work.
I did notice that the "cite" variations were redirects. You can see some here and here. Probably the simplest way to cover the redirects would be to slightly alter the regex to be more broad. I did this here #155.
I'm going to try to make an exhaustive redirect search tomorrow to find all of these kinds of redirects, which will probably include more than just the Cite/cite thing. Might need to take a look at the compiled regex, too.
I've now committed a bunch of different ignored template names into ignored_etymology_templates, specifically redirects grepped from extract.json. Took me a while to find where the data is stored, but it is saved into the json output of wiktextract. Somehow that hasn't gotten mapped into the json mapping on the diagnostics page (probably something like redirects are not words and are skipped...).
I only added redirects, so @jmviz, if you commit more template names like in the first post and make a pull request I will merge it asap.
Actually, now that I've added those template names it might be better to wait for an update on the site and then generate a list of template names used in etymology_template, just in case. In retrospect (for me) the issue was the splintering of the templates into smaller ones, which shouldn't be an issue anymore, and maybe something breaking now that there are a few blacklisted names in ignored_etymology_templates. After the next update tomorrow I'll try to trawl through the json data (root > etymology_templates > name) to see how many different kinds of templates people put in the etymology section.
etymology_templates.txt
A quick and dirty list of all the different templates found in etymology sections (or at least in etymology_templates). Format is
template_name; example_word/Language × 10
and I think it is comprehensive.
The only template name really present that is mentioned in the first post is still {{lang}}, which, afaict, is being used correctly to indicate text in another language in at least "false friends"/English.
One problem might be wikiquote, wikipedia, wikidata and especially wikispecies sidebars. When an article starts with the Etymology section, sometimes these slip there when they shouldn't; but I think we use at least wikispecies data for something, so difficult to say whether these should be filtered in ignored_etymology_templates_re.
If you find something that seems to be similar to this issue, highlight it by starting a new thread; closing this as complete (mostly) for now.