enigma icon indicating copy to clipboard operation
enigma copied to clipboard

How works the generation of Single/Bigrams/Trigrams/Quadgrams

Open Rlvx opened this issue 3 years ago • 1 comments

Hello,

I am trying to use your code on a french ciphertext, so i need to change the values in "/resources/data" to make it work. Unfortunately, i don't understand how did you generate the score for Single/Bi-grams/Tri-grams and quad-grams. I suppose you use an English sample text to generate those value, and if so, it would be great if i could have the source code for this particular part.

Thanks by advance !

Rlvx avatar Apr 09 '22 10:04 Rlvx

Hi! The data I used for this was apparently a count of the bigrams, trigrams etc. from the google books archive. Now I look though, I can't remember where I got it from! I think original it came from the google ngram viewer, getting data out of this is possible, but not trivial. Note thaty some of these website use n-grams to mean words, some characters.

E.g. here: https://stressosaurus.github.io/raw-data-google-ngram/

I downloaded some CSV files with raw likelihoods of n-grams from 1-gram to 7-gram I think. I then took only the 1-4 sets, and converted these into negative log likelihood probabilities to make compulation easier. For each probability I calculated log10(p).

This resource also might help:

https://github.com/orgtre/frenchngrams

This is word n-grams, but you could extract character values from 6grams pretty well I'd have thought. The number of words here is so high that the frequencies wouldn't change too much, I'd guess.

mikepound avatar Apr 09 '22 21:04 mikepound