enigma
enigma copied to clipboard
How works the generation of Single/Bigrams/Trigrams/Quadgrams
Hello,
I am trying to use your code on a french ciphertext, so i need to change the values in "/resources/data" to make it work. Unfortunately, i don't understand how did you generate the score for Single/Bi-grams/Tri-grams and quad-grams. I suppose you use an English sample text to generate those value, and if so, it would be great if i could have the source code for this particular part.
Thanks by advance !
Hi! The data I used for this was apparently a count of the bigrams, trigrams etc. from the google books archive. Now I look though, I can't remember where I got it from! I think original it came from the google ngram viewer, getting data out of this is possible, but not trivial. Note thaty some of these website use n-grams to mean words, some characters.
E.g. here: https://stressosaurus.github.io/raw-data-google-ngram/
I downloaded some CSV files with raw likelihoods of n-grams from 1-gram to 7-gram I think. I then took only the 1-4 sets, and converted these into negative log likelihood probabilities to make compulation easier. For each probability I calculated log10(p).
This resource also might help:
https://github.com/orgtre/frenchngrams
This is word n-grams, but you could extract character values from 6grams pretty well I'd have thought. The number of words here is so high that the frequencies wouldn't change too much, I'd guess.