wiktextract
wiktextract copied to clipboard
Generating Just a List of Latin Words
Hi there,
Thank you very much for the programme, but I'm afraid that I've found a little difficulty navigating it - partly because my device doesn't seem capable of handling large quantities of words. Is there any way to just generate a list of words within a particular language, without having any of the additional data?
For example, I'm hoping to create some form of Latin Scrabble, so having just a list of Latin words in a TXT/JSON/CSV file is what I'm trying to achieve. Please forgive me if I've missed something as I've read over the documentation repeatedly, but I'm still struggling to understand how to generate one, so would be grateful for any assistance!
Thank you very much! :)
The easiest way would be to download the compressed raw data (here) and process that directly. You should only need about ~1.5GB free space on your machine. Then you can read this file line by line, which shouldn't be overly taxing on your machine. It might take a while though:
import gzip
import json
with gzip.open("raw-wiktextract-data.json.gz", "rt") as in_file, open("latin.txt", "w") as out_file:
for line in in_file:
entry = json.loads(line)
if entry.get('lang_code') == 'la': out_file.write(entry['word']+'\n')
This got me 860,340 latin "words". This will include a bunch of inflected/conjugated forms, some affixes, etc. that you may or may not want to include in something like a scrabble list. So you might want to add further logic inspecting other parts of the json for each entry to decide whether to use the given "word".
@jmviz Really sorry, but I've been struggling to run this script for some reason - not very familiar with Python or the project! If it wouldn't be too much to ask, would you mind sending me that list of 860K words? [email protected]
Would be really grateful, though totally understand if not possible - thank you very much!
Sure, I sent it. Like I said, there will be various stuff in there (affixes, phrases, abbreviations, inflections, etc.) that you may or may not want.
Thank you so much, really appreciated!! :)