French edition "tags" fields are lists - should be space-separated strings
The "tags" fields in the French edition extraction are lists of strings. They should be single strings with space-separated tags for consistency with other editions.
With debug messages added for invalid extraced JSON data, the extract.errors file is 127 GB! I think most of this is complaints about debug messages for the "tags" field.
I added code in wiktextract to limit the number of errors, warnings, and debug messages collected to 100000 each. The huge errors file was crashing web site generation, as it loads the whole errors file into memory and then uses python's multiprocessing, which forks ,many instances of itself, and due to python's reference counting memory management the data actually gets copied many times.
All extractors are using list of strings type for the tags field and I don't think there are space-separated field in any extractors. So the check_tags() function always return at line 200 and adds an error. And there are two undefined variables in the last if branch.
IMO the check JSON code could process the final JSONL file, so it won't impact the performance. And most basic checks like data types and field names are already enforced by pydantic.
check_json_data() runs in the main process, I guess the error variables won't be copied after fork? But the main problem is check_tags adds error for every page, and we should improve the save errors code, maybe using the logging standard library?
What Tatu meant was that tags shouldn't be lists of strings with single tags, they should be lists of strings with multiple tags. The main unit of tags is a string of tags separated by space "like this example", and you only have separated strings when you need to create two alternative readings of tags; when there's an "or" in the text that creates two sets of tags that are incompatible, like "nominative plural or genitive singular" resulting in ["nominative plural", "genitive singular"]; it should not create ["nominative", "plural", "genitive", "singular"].
I don't remember fr extractor has any split tag text with white space code, mostly they are extracted from templates or italic nodes. Could you provide some examples in the fr edition data?
I don't understand what you are asking.
I still don't understand what is the problem this issue describes, the original issue says tags should be str, and your last command describes another problem says the phrase tag like nominative plural shouldn't be split to two tags.
But I don't recall the code splits the tags by white space and it take times for me to check all the tag fields code. So it would be nice if you already know which page has this problem so I can fix it quicker.
In https://kaikki.org/frwiktionary/Fran%C3%A7ais/meaning/v/vo/voir.html
"tags": [
"Modes impersonnels",
"Infinitif",
"Passé"
]
should be "tags": ["impersonal infinitive past"], or "modes-impersonnels infinitif passé" if no translation of the tags is done.
This is extracted from the table headers in the "Conjugaison" pages: https://fr.wiktionary.org/wiki/Conjugaison:français/voir
The tags are not combined text like in the English edition's gloss text, I think the English extractor also doesn't combine table headers of the conjugation table.
I not sure how to deal with tags... I guess I have to rename all of them to tags_np field as suggested at https://github.com/tatuylonen/wiktextract/issues/489#issuecomment-1963194049 then slowly translate some most common tags to the tags used in the English extractor. I not sure if you would agree with this plan.
The English extractor does combine table headers of conjugation tables to the words. If you have a table called "Nominatives", all its forms are tagged with "nominative etc".
It's difficult for me to guess the standard of en extractor by looking the extracted JSON data... I mean I saw the same voir page on English Wiktionary doesn't seem to combine row headers and column headers of the huge conjugation table. But I guess you mean the table name "Modes impersonnels" of the first table in page https://fr.wiktionary.org/wiki/Conjugaison:français/voir should before row and column headers, like ["Modes impersonnels Infinitif", "Modes impersonnel Présent"]?
No, "tags" should not contain singular strings of singular tags, the strings are collections of tags that are separated by spaces. Forget about the list, it's not important; the tags should be substrings separated by spaces inside the larger string. The list is there just because you sometimes have conflicting alternatives that shouldn't be grouped together, alternative sets of tags.
Ok, I took a look at the table (took a while to find it), and "mode impersonnels" was a misread from me, of course it was just a header for other headers. In this case, it's not needed because "infinitif" implies it. The string inside tags should be "infinitif passé", "infitive past". Apologies for this mixup.
"infitif" and "passé" are two tags that are combined into a single string (separated by spaces). The list will in most situations only contain one of these strings containing tags; if not, then they're alternative sets for the same form. In tables, this happens less often than forms in heads, because each form gets its own entry in a table while forms in heads don't (although I guess you could separate them instead of having "tags" be a list).
Really, "tags" should be "tagsets", because it's a list of strings contains sets of tags... It's a bit too late to change now.
I literally have no idea how to deal with all these tags code now... I don't know what to do except changing the field name to another name. I guess it's safe to say if a not-tag string has the exact same meaning as the strings defined at here https://github.com/tatuylonen/wiktextract/blob/061781b13f152f09e028411a812586f7d1737526/src/wiktextract/tags.py#L5196 then I could add the string to tags?
I think I better make sure the values in the tags field are tags first then consider to combine them or not.
The code for parsing tags (what is a tag, what is it grouped with, etc.) for the English extractor is ridiculously complex, yeah.
We do not want separate tag strings, always combine for now. Combine by default, create alternative tagset strings only when needed. If you're already generating a tags field for a sense, just combine the tags into one string, separating with spaces; tags that are made out of several words (or generated out of several words, which can sometimes happen even with English tags, especially for locations etc.) have their whitespace replaced with - dashes.
I more confused now... I need to know at what condition and which tags should be combined in which order. Because I still don't know the rules and the tags are not considered tags, if I combine them now in the wrong way I'll have more dead code.
Could you point to me the code in en extractor that combines tags?
I guess the tags you want to combine are mostly the header tags(defined in xlat_head_map, gender and tense?) extracted from header line(between pos title and gloss list), like "masculine plural" in en page voyant(wrong example, the tags are not combined), but not tags in the gloss text like the "(Littérature) (Rare)" in fr page autrice or the tags defined in valid_tags? And because the similar tags in fr edition are in the conjugaison table(or other inflection table) row and column headers, the code should combine the table headers but I guess combine them at all situations is also incorrect because inflection tables have different layouts and contents.
I think I should first change all tags fields to raw_tags as you suggested at issue 489, even change the field name is kind painful.
All tags are combined into one tag set. The order doesn't matter. Tags are independent of each other. They are just all clumped together, added to the set of tags when encountered, and only in special cases do we have separate tagsets for one form or whatever, for cases where you have alternatives like "nominative plural or genetive singular" where you have a form that means either nominative plural or genitive singular and need to show that these two tagsets are separate and don't belong together.
In a table, each cell has tags that it gets from the table's title main header (like say if you have a whole table for just X forms, "x" is one of the tags in the tagset), then from its headers when applicable (so I was wrong with "modes impersonnels, because that was a header for other headers, not a header applicable to a form) like "nominative" or "plural", and possibly tags from inside the cell itself (you can have footnotes and references etc. inside the cell itself...).
In voyant, we have voyant (feminine voyante, masculine plural voyants, feminine plural voyantes) as a head; "voyant" is already in "word", and because it doesn't look different from the title of the page voyant it doesn't get a separate form. If it was different (because of encoding issues or something) then it would get a form and a "canonical" tag.
"voyante", "voyants", "voyantes" get their own "forms" fields; "voyante" gets the "tags" field ["feminine"], "voyants" ["masculine plural"] (or "plural masculine", doesn't matter), "voyantes" ["feminine plural"].
Let's say there was a form voyanteX, an irregular form that means either "ergative plural" or "accusative dual". In this case, it would have a "tags" field ["ergative plural", "accusative dual"], which would have to be parsed from its context somehow; there's a ton of code for this in the English tag parsing code, and it's a big hassle, a real pain.
"raw_tags" seems good, thanks for that.
In other words, don't worry about figuring out what tags belong together just yet; that's a big, big parsing job and can be left alone. You can just dump all the tags for a form into one tagset and it's better than nothing.
I think I better just focus on form-of and alt-of tag...
Looks like I was wrong here and the "tags" field really is a list of strings now (it was once a space-separated string). One of the reasons for it being a single string was that there are tens of millions (if not hundtreds of millions) of tags fields in the English data alone, and lists take significantly more memory than strings, so the lists add up to gigabytes of extra memory required to load the data into memory.
But, having the tags fields as lists is clearly the cleaner solution, so let's keep them as lists.
My original idea was that there should be a core set of grammatical/semantic/typographic tags that would be the same for all languages. For example, "transitive", "intransitive", "infinitive". Generally, these tags should be listed in valid_tags, though currently valid_tags also probably contains some tags that don't really belong there.
There can additionally be other tags that are language-specific. An example is tags for regional dialects or regionally used words. For English, these are in uppercase_tags. Other languages would have their own.
Writing a translation dictionary that translates the common grammatical tags from language-specific tags to those in valid_tags would not be a major effort, perhaps a day of work for each edition. Dealing with the dialec/region tags is much more work. However, the tags fields are not intended as a dumping place for all junk; there should not be 100000+ tags in a language/edition without a very good reason.
The main purpose of the tags field is to enable the use of the data in downstream applications in a cross-linguistic way.
I just fixed the check for the "tags" field to require that it is a list of strings.
I also fixed the test for "forms"/"form" to not check it if "table-tags" is present in tags (probably only affects the English edition at this time).
Also, xlat_head_map is completely specific to the English wiktionary, and only serves to convert text from Wiktionary into one or more tags or to ignore it. One should draw no conclusions about tag combinations based on it.
Tags are generally intended to be independent of each other, and there are no specific combinations (though clearly some tags relate to specific parts of speech). The exception to this is tags such as "infinitive-1", where I've also tried to add "infinitive" because the form being infinitive may be useful for downstream applications (including machine learning), while some languages have multiple infinitives with more specific meanings. Same for participles and a few other similar cases.
Wait, when did we change tags to be lists of single-tag strings?
Or was I just mixing it up with how they are represented internally??
But what about all of the code with the alternative tagset separation, what happens with all of that data?
Tagsets are represented as lists of lists of tags internally (or-of-ands) and generate multiple entries in the data if there are multiple alternatives.
We discussed this with Kristian and I'll close this issue now. No action needed.