featherduster icon indicating copy to clipboard operation
featherduster copied to clipboard

use nltk for frequency analysis, refactoring and for clear text brute force?

Open gogo2464 opened this issue 2 years ago • 9 comments

The frequency analysis values are currently hardcoded in cryptanalib/frequency.py.

I think if we use nltk instead of hardcoding, we will: -gain in code lisibility -be able to get frequency for more languages -but also to compare brute forced decrypted text to check if the clear text correspond to an existing language

I need your opinion. If you agree, I can implement it.

gogo2464 avatar Mar 26 '22 20:03 gogo2464

-but also to select a specific corpus with specific them in case we know the type of document to decipher -select a corpus from the public avaible book published by the victim (or the CTF maker lol!) or with OSINT

gogo2464 avatar Mar 26 '22 21:03 gogo2464

we currently do not really know where such data come from. this is not easy to maitain.

gogo2464 avatar Mar 26 '22 21:03 gogo2464

Thanks for your comment! Having support for alternate languages is definitely a goal, an issue already exists for that at #31.

However, I'd prefer not to add another dependency, especially one which relies on native code objects, since it limits the ways FD/CA can be used. Every additional dependency also means that the project may potentially break if the dependency breaks, or changes its API, making the code harder to maintain.

I have some comments on the proposed benefits:

gain in code lisibility

It's true that hard-coding frequency distributions is not ideal, but surely even nltk hard-codes that data somewhere, because the alternative is to re-generate the distribution data at runtime and hard-code in the corpus, which is far worse from the perspective of resources used for both storage and startup.

be able to get frequency for more languages but also to select a specific corpus with specific them in case we know the type of document to decipher select a corpus from the public avaible book published by the victim (or the CTF maker lol!) or with OSINT

There is currently a poorly-advertised script at util/generate_frequency_tables.py that consumes a file and produces a frequency distribution suitable for use with the functions in cryptanalib. It could really use some work, though.

we currently do not really know where such data come from. this is not easy to maitain.

A fair criticism. The English language data is based off of Charles Dickens' A Tale of Two Cities. I'm sure there is a better corpus that could be used, though I don't understand how this makes the project harder to maintain.

but also to compare brute forced decrypted text to check if the clear text correspond to an existing language

There is already a function called detect_plaintext() which does this; it is used in many parts of the existing code to enable many of the existing implemented attacks, such as the single-byte-xor cipher solver. However, it must be fed a particular distribution dict, it does not attempt to identify what distribution out of many it most closely matches. That would be a nice feature.

To me, it seems like effort would be better placed toward generating more frequency distributions, improving the built-in tool for generating frequency data, and, for better readability, changing the frequency module to dynamically load all files from a frequency_tables directory or some such, so users can generate their own frequency data and drop it in easily. It should also be better documented.

I'm still willing to be convinced otherwise, though.

unicornsasfuel avatar Mar 27 '22 14:03 unicornsasfuel

A fair criticism. The English language data is based off of Charles Dickens' A Tale of Two Cities. I'm sure there is a better corpus that could be used, though I don't understand how this makes the project harder to maintain.

Interesting. We should maybe just document it somewhere.

gogo2464 avatar Mar 27 '22 18:03 gogo2464

-Nltk could be actually very very cool for word analysis in diffrent languages; More than caracters. -select a specific category of text to get frequency like novel, sci-fi, lore, etc...

gogo2464 avatar Mar 27 '22 18:03 gogo2464

In my opinion the best things to do is to: -set default values from nltk to the hardcoded address -allow people to generate their own frequencies from nltk to optionally replace the old with their own custom corpus -also set more word from more languages in the hardcoded database

gogo2464 avatar Mar 27 '22 18:03 gogo2464

I would welcome any of these improvements if they could be made without adding nltk as a dependency.

unicornsasfuel avatar Mar 27 '22 18:03 unicornsasfuel

@unicornsasfuel understood! I will generate more more hardcoded stats in more languages with no nltk.

Just in case you want to get a look, I made a repo with nltk to compare. Look at https://github.com/gogo2464/cryptatools/blob/master/cryptalib/frequency.py

gogo2464 avatar Mar 27 '22 19:03 gogo2464

@unicornsasfuel Hello. I am very sorry. Since the last time, I started a rewrite a very little bit inspired by featherduster with a very different philosohphy.

I do not have the time anymore for this PR. Sorry. I may finish it when I will have the time.

I may use my own repo to generate the data and hardcode it in your repo if I find the time.

Sorry for the inconvenience.

gogo2464 avatar Nov 10 '22 14:11 gogo2464