wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

Generating Just a List of Latin Words

Open Aurorum opened this issue 2 years ago • 1 comments

Hi there,

Thank you very much for the programme, but I'm afraid that I've found a little difficulty navigating it - partly because my device doesn't seem capable of handling large quantities of words. Is there any way to just generate a list of words within a particular language, without having any of the additional data?

For example, I'm hoping to create some form of Latin Scrabble, so having just a list of Latin words in a TXT/JSON/CSV file is what I'm trying to achieve. Please forgive me if I've missed something as I've read over the documentation repeatedly, but I'm still struggling to understand how to generate one, so would be grateful for any assistance!

Thank you very much! :)

Aurorum avatar Aug 07 '22 12:08 Aurorum

The easiest way would be to download the compressed raw data (here) and process that directly. You should only need about ~1.5GB free space on your machine. Then you can read this file line by line, which shouldn't be overly taxing on your machine. It might take a while though:

import gzip
import json

with gzip.open("raw-wiktextract-data.json.gz", "rt") as in_file, open("latin.txt", "w") as out_file:
    for line in in_file:
        entry = json.loads(line)
        if entry.get('lang_code') == 'la': out_file.write(entry['word']+'\n')

This got me 860,340 latin "words". This will include a bunch of inflected/conjugated forms, some affixes, etc. that you may or may not want to include in something like a scrabble list. So you might want to add further logic inspecting other parts of the json for each entry to decide whether to use the given "word".

jmviz avatar Aug 07 '22 14:08 jmviz

@jmviz Really sorry, but I've been struggling to run this script for some reason - not very familiar with Python or the project! If it wouldn't be too much to ask, would you mind sending me that list of 860K words? [email protected]

Would be really grateful, though totally understand if not possible - thank you very much!

Aurorum avatar Aug 14 '22 09:08 Aurorum

Sure, I sent it. Like I said, there will be various stuff in there (affixes, phrases, abbreviations, inflections, etc.) that you may or may not want.

jmviz avatar Aug 14 '22 10:08 jmviz

Thank you so much, really appreciated!! :)

Aurorum avatar Aug 14 '22 10:08 Aurorum