wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

[error] derived terms are skipped in hierarchical entries (like Proto-Indo-European)

Open alexchandel opened this issue 2 years ago • 5 comments

Unlike living languages, Proto-Indo-European words' derived terms are given in a hierarchical list, rather than a flat list. But wiktextract/kaikki only picks up the "deepest" entry. For example, in the PIE term sek-, the first derived term (reproduced below) should be "sek-eh₂-yé-ti" and the second "sekajō", but these is skipped for "secō". These intermediate derived terms should not be skipped over.

  • sek-eh₂-yé-ti or *sek-h₁-yé-ti
    • Proto-Italic: *sekajō
      • Latin: secō (see there for further descendants)

alexchandel avatar Aug 08 '22 20:08 alexchandel

@kristian-clausal Where in the code is this handled?

alexchandel avatar Aug 10 '22 05:08 alexchandel

I assume it's not, at least for this kind of list. We're prioritizing living and attested languages, so most peculiarities in proto-language and Reconstruction: category entries are just run through as if they were normal entries in a normal Wiktionary article, hence why they're handled badly.

You probably should not rely on data in Reconstruction: entries generated by wiktextract. This may change in the future, but we have a lot on our plates still with everything else and creating or interweaving all the code needed specifically to parse Reconstruction: entries is a lot.

kristian-clausal avatar Aug 10 '22 06:08 kristian-clausal

@alexchandel In case you're interested in running wiktextract locally to get this information, I have a fork that outputs basic data for Descendants and PIE Derived terms/Extensions sections. It outputs an array of objects corresponding to each line in the list, that each have data like wiktextract's etymology_templates/etymology_text. Then there is a depth key to record the level of nesting of the line. Since the objects are in the same order as the lines in the wikitext, you can recover the proper full tree structure by tracking the depth while iterating through the objects. Here's what the beginning of the output for sek- looks like:

"descendants": [
    {
      "depth": 1,
      "tags": [
        "derived"
      ],
      "templates": [
        {
          "args": {
            "1": "ine-pro",
            "2": "",
            "3": "*sek-eh₂-yé-ti"
          },
          "expansion": "*sek-eh₂-yé-ti",
          "name": "l"
        },
        {
          "args": {
            "1": "ine-pro",
            "2": "",
            "3": "*sek-h₁-yé-ti"
          },
          "expansion": "*sek-h₁-yé-ti",
          "name": "l"
        }
      ],
      "text": "*sek-eh₂-yé-ti or *sek-h₁-yé-ti"
    },
    {
      "depth": 2,
      "templates": [
        {
          "args": {
            "1": "itc-pro",
            "2": "*sekajō"
          },
          "expansion": "Proto-Italic: *sekajō",
          "name": "desc"
        }
      ],
      "text": "Proto-Italic: *sekajō"
    },
    {
      "depth": 3,
      "templates": [
        {
          "args": {
            "1": "la",
            "2": "secō"
          },
          "expansion": "Latin: secō",
          "name": "desc"
        },
        {
          "args": {},
          "expansion": "(see there for further descendants)",
          "name": "see desc"
        }
      ],
      "text": "Latin: secō (see there for further descendants)"
    },
    {
      "depth": 1,
      "tags": [
        "derived"
      ],
      "templates": [
        {
          "args": {
            "1": "ine-pro",
            "2": "*skey-",
            "3": "*sk-éy-ti",
            "pos": "*éy-present"
          },
          "expansion": "*sk-éy-ti (*éy-present)",
          "name": "l"
        }
      ],
      "text": "*sk-éy-ti (*éy-present)"
    },

jmviz avatar Aug 15 '22 19:08 jmviz

Would be nice to merge this fork. "Descendants," "Extensions," "Derived terms" are all standard sections according to Wiktionary's entry layout guidelines.

alexchandel avatar Sep 19 '22 02:09 alexchandel