Scribe-Data Explore using Wiktionary dumps to derive translation data

Terms

[x] I have searched open and closed feature requests
[x] I agree to follow Scribe-Data's Code of Conduct

Description

The Scribe community needs translation data for its projects. One means of achieving this would be to get the data from Wiktionary. The benefits of this are that the data is expansive and also includes data based on the various versions of a word. This issue would entail looking into the following:

First exploring the current API process: https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wiktionary/parse_mediaWiki.py
The output for this is appropriate for what we need and should be modeled in the new process
- We want a dictionary where the keys are strings that are words from Wiktionary and the first key is a data type like noun or verb
- Then within these sub-dictionaries we would have the ISO-2 of the translation as a key and then the translation as a value
- We'd also want the description of the word from Wiktionary
  - Example: https://en.wiktionary.org/wiki/book/translations#Noun
  - We'd want "collection of sheets of paper bound together containing printed or written material"

{
  "book": {
    "noun": {
      "1": {
        "_description": "collection of sheets of paper bound together containing printed or written material",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      "2": {
        "_description": "another_description",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      ...
    },
    "verb": {
      "1": {
        "_description": "to reserve",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      }
      ...
    }
  }
}

The new process would be based on the Wiktionary dumps (only EN)
We'd want a function that when called would get all translations for all words from a Wiktionary dump
- Inputs would be an ISO-2 code for the language and an optional dump ID (for if we need to run it on a specific dump - defaults to latest)

Note: We could consider using wiktextract for this

Contribution

Happy to explore how to proceed here and also help with coding/review :)

Nov 01 '25 15:11 andrewtavis

CC @catreedle and @axif0 👋

Nov 01 '25 15:11 andrewtavis

@catreedle: Steps for this would be:

Download one Wiktionary dump
Get all translations for a single word from the dump
Get all translations for all words from the dump
Start with a smaller Wiktionary (https://dumps.wikimedia.org/idwiktionary/)

Nov 01 '25 16:11 andrewtavis

Scribe-Data Scribe-Data copied to clipboard

Explore using Wiktionary dumps to derive translation data

Terms

Description

Contribution

Scribe-Data
Scribe-Data copied to clipboard