securedrop icon indicating copy to clipboard operation
securedrop copied to clipboard

i18n: european portuguese word list

Open lisbonjoker opened this issue 2 years ago • 3 comments

Imported from Dicionários Natura in https://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/ Needs trim of the words that would be difficult to memorize.

lisbonjoker avatar Jul 19 '21 12:07 lisbonjoker

Hi @lisbonjoker!

I'd like to understand better how to review your PR. Could you please give some more context on that word list and what made you choose it?

For example, I'd love to hear more about:

  • How is the word list licensed?
  • How is it composed? By whom?
  • What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?
  • What purpose would it fulfill in the context of SecureDrop?

gonzalo-bulnes avatar Nov 16 '21 05:11 gonzalo-bulnes

How is the word list licensed?

The dictionaries are covered by the GPL, LGPL, and MPL licenses (or at least one of them)

How is it composed? By whom?

The Natura Project is a small research group in Natural Language Processing at the Department of Computer Science, University of Minho. It is part of a larger Language Processing and Specification group.

More in: https://natura.di.uminho.pt/wiki/doku.php?id=dicionarios:main

Current Management

José João Almeida Alberto Simões

Other collaborators

Rui Vilela António Dias Paulo Rocha Ulisses Pinto

What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?

List of Portuguese words (including some acronyms, etc).

It contains proper names, acronyms, abbreviations and common loanwords; This list is derived from the Jspell dictionary for morphological analysis.

What purpose would it fulfill in the context of SecureDrop?

For European Portuguese citizens to use in a SecureDrop as there is a big difference in languages between PT BR and PT EU. Some words unused or unrecognized.

lisbonjoker avatar Jan 07 '24 03:01 lisbonjoker

Don't know if this is helpful to this conversation, but:

in an effort to cut this very long list down to a length closer to the existing SecureDrop wordlists, I took the most frequently appearing words from Portuguese Wikipedia articles (with help from this project), then filtered out any and all words NOT on this 994,951-word list.

I then removed any and all words with accented characters or non-UTF-8 characters (I think), all words not between 3 and 15 characters, and any Roman numerals. (Notably I didn't filter out profane words.) I arbitrarily chose to make this new list 10,000 words. The result was this wordlist. Hope this helps -- sorry if it derails things.

sts10 avatar Jan 09 '24 03:01 sts10