smaz icon indicating copy to clipboard operation
smaz copied to clipboard

codebook with the most frequent ngrams in language/s

Open wis opened this issue 7 years ago • 0 comments

I know this guy..;) (from Redis) did you hand pick the codebook dictionary? how? have you though about using the most frequent ngrams in language/s? e.g the top (e.g 32) ngrams from Norvig's ngrams2,3,4,5,6,7,8,9.csv? How do you optimally pick them for minimum overlap and better compression rates? i.e ation and tion are the most common 4 and 5 letters long ngrams respectively, tio is the 6th most common 3 letters ngram. I think you'd get much better/higher compression rates.

I wanna test it, but couldn't find any docs. so what are these characters?

static char *Smaz_cb[241] = {
"\002s,\266", "\003had\232\002leW", "\003on \216", "", "\001yS",
"\002ma\255\002li\227", "\003or \260", "", "\002ll\230\003s t\277",

wis avatar Jun 25 '18 11:06 wis