Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Explore using Wiktionary dumps to derive translation data

Open andrewtavis opened this issue 1 month ago • 2 comments

Terms

Description

The Scribe community needs translation data for its projects. One means of achieving this would be to get the data from Wiktionary. The benefits of this are that the data is expansive and also includes data based on the various versions of a word. This issue would entail looking into the following:

  • First exploring the current API process: https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wiktionary/parse_mediaWiki.py
  • The output for this is appropriate for what we need and should be modeled in the new process
    • We want a dictionary where the keys are strings that are words from Wiktionary and the first key is a data type like noun or verb
    • Then within these sub-dictionaries we would have the ISO-2 of the translation as a key and then the translation as a value
    • We'd also want the description of the word from Wiktionary
      • Example: https://en.wiktionary.org/wiki/book/translations#Noun
      • We'd want "collection of sheets of paper bound together containing printed or written material"
{
  "book": {
    "noun": {
      "1": {
        "_description": "collection of sheets of paper bound together containing printed or written material",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      "2": {
        "_description": "another_description",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      ...
    },
    "verb": {
      "1": {
        "_description": "to reserve",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      }
      ...
    }
  }
}
  • The new process would be based on the Wiktionary dumps (only EN)
  • We'd want a function that when called would get all translations for all words from a Wiktionary dump
    • Inputs would be an ISO-2 code for the language and an optional dump ID (for if we need to run it on a specific dump - defaults to latest)

Note: We could consider using wiktextract for this

Contribution

Happy to explore how to proceed here and also help with coding/review :)

andrewtavis avatar Nov 01 '25 15:11 andrewtavis

CC @catreedle and @axif0 👋

andrewtavis avatar Nov 01 '25 15:11 andrewtavis

@catreedle: Steps for this would be:

  • Download one Wiktionary dump
  • Get all translations for a single word from the dump
  • Get all translations for all words from the dump
  • Start with a smaller Wiktionary (https://dumps.wikimedia.org/idwiktionary/)

andrewtavis avatar Nov 01 '25 16:11 andrewtavis