android-malware-analysis
android-malware-analysis copied to clipboard
This project seeks to apply machine learning algorithms to Android malware classification.
Getting an API Key
AndroTotal has simplified the process for getting an API Key. Login/Create an Account at http://andrototal.org/ and you will then be able to view your profile settings. There is an API Tab which contains your key.
This repository contains a set of scripts to automate the process of gathering data from malware samples, training a machine learning model on that data, and plotting its classification accuracy.
-
Make a copy of config-template.ini called config.ini and edit it.
-
Ensure that the "tools" subdirectory has been initialized ("
$ git submodule update --init tools") -
Either use
get_samples.pyto download samples or copy them into "all_apks" from another source. If you're usingget_samples.py, you can monitor it in another shell by runningwatch "ls -l *.apk | wc -l" -
sort_malicious.pyuses andrototal.org to sort them into "malicious_apk" and "benign_apk" folders. You can monitor it in another shell by runningwatch "ls -l benign_apk/*.apk | wc -l && ls -l malicious_apk/*.apk | wc -l" -
extract_apks_parallel.shunpacks the .apk files into folders and processes some of the data therein. You can monitor it in another shell by runningwatch "wc -l benign_apk/valid_apks.txt; wc -l malicious_apk/valid_apks.txt" -
Run one of the following scripts to generate feature vectors:
parse_xml.pyfor permissions. "app_permission_vectors.json" is generatedparse_maline_output.pyfor syscalls. "app_syscall_vectors.json" is generated. You will have to run maline first for this to work.parse_disassembled.pyfor API calls. "app_method_vectors.json" is generatedparse_ssdeep.pyfor fuzzy hashes. "app_hash_vectors.json" is generated. You will have to run ssdeep first for this to work.combine_features.pyfor a combination of the top weighted features. "app_feature_vectors.json" is generated. This only works if you've previously trained a network on the specified features, and the feature weights files are named appropriately.
-
Run
$ run_trials.sh app_feature_vectors.json(or whichever json you want) which runs thetensorflow_learn.pyscript (where the ML happens) a number of times and puts the results into a folder. It also runsplot_data.pyandmatch_features.pyto create a plot and create a list of top weighted features, respectively. -
Change the parameters or input data and repeat step 6. It should be non-destructive so you can compare the results of different runs.
Note: If you want to use a SVM instead of a neural network, use sklearn_svm.py in place of tensorflow_learn.py. You can also use sklearn_tree.py to use a decision tree.