securedrop
securedrop copied to clipboard
i18n: european portuguese word list
Imported from Dicionários Natura in https://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/ Needs trim of the words that would be difficult to memorize.
Hi @lisbonjoker!
I'd like to understand better how to review your PR. Could you please give some more context on that word list and what made you choose it?
For example, I'd love to hear more about:
- How is the word list licensed?
- How is it composed? By whom?
- What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?
- What purpose would it fulfill in the context of SecureDrop?
How is the word list licensed?
The dictionaries are covered by the GPL, LGPL, and MPL licenses (or at least one of them)
How is it composed? By whom?
The Natura Project is a small research group in Natural Language Processing at the Department of Computer Science, University of Minho. It is part of a larger Language Processing and Specification group.
More in: https://natura.di.uminho.pt/wiki/doku.php?id=dicionarios:main
Current Management
José João Almeida Alberto Simões
Other collaborators
Rui Vilela António Dias Paulo Rocha Ulisses Pinto
What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?
List of Portuguese words (including some acronyms, etc).
It contains proper names, acronyms, abbreviations and common loanwords; This list is derived from the Jspell dictionary for morphological analysis.
What purpose would it fulfill in the context of SecureDrop?
For European Portuguese citizens to use in a SecureDrop as there is a big difference in languages between PT BR and PT EU. Some words unused or unrecognized.
Don't know if this is helpful to this conversation, but:
in an effort to cut this very long list down to a length closer to the existing SecureDrop wordlists, I took the most frequently appearing words from Portuguese Wikipedia articles (with help from this project), then filtered out any and all words NOT on this 994,951-word list.
I then removed any and all words with accented characters or non-UTF-8 characters (I think), all words not between 3 and 15 characters, and any Roman numerals. (Notably I didn't filter out profane words.) I arbitrarily chose to make this new list 10,000 words. The result was this wordlist. Hope this helps -- sorry if it derails things.