anomaly-detection copied to clipboard
Anomaly detection using autoencoder and dbscan / 3(5)-sigma (adapted for Kotlin code anomaly detection)
Anomaly detection using autoencoder and dbscan (adapted for Kotlin source code anomaly detection)
Available steps (stages):
: run the autoencoder on the specified dataset, calulating and write vectors with differences (between input and output) or simple euclidean distances array; -
: read difference vectors or distances vector and anomaly selection in it (via DBScan or 5-sigma); -
[without specify stage]
: run of both stages without intermediate write of difference vectors or distances vector in a file.
Program use
At this stage, the vectors representing the AST are encoded and decoded, and the differences between the input vectors and decoded vectors is written to a file
Stage arguments
-> autoencoding; -
(default=false): whether to use dbscan (high memory or time usage!) - then will use full differences between autoencoder input and output vectors matrix; if not, then will use simple euclidean distance between autoencoder input and output vectors; -
: path to dataset file (csv format with colon delimiter); -
: dataset train/test split percent; -
: encoding dim percent (towards features number); -
: path to file with input-decoded difference (full differences matrix if --use_dbscan=True or simple distances vector if not).
Example of use
With DBScan:
python3 -s autoencoding --use_dbscan -f dataset.csv --split_percent 0.9 --encoding_dim_percent 0.8 --differences_output_file differences.bin
Without DBScan:
python3 -s autoencoding -f dataset.csv --split_percent 0.9 --encoding_dim_percent 0.8 --differences_output_file distances.json
If use full differences matrix (with --use_dbscan option), then the file will be written in binary mode.
Anomaly selection
At this stage, anomalies are selected by the difference matrix (via DBScan) or the distances vector (via 5-sigma)
Stage arguments
-> autoencoding; -
(default=false): whether to use dbscan (high memory or time usage!) - then will use full differences between autoencoder input and output vectors matrix; if not, then will use simple euclidean distance between autoencoder input and output vectors; -
: path to file with input-decoded difference (full differences matrix or simple distances vector), obtained previous stage (autoencoding); -
: file with map dataset indexes and ast file paths, obtained by ast-set2matrix with stage=vectors2matrix; -
: path to file, which will contain ranking anomaly list (as paths to AST code snippets and ranks);
Example of use
With DBScan:
python3 -s anomaly_selection --use_dbscan --differences_file differences.bin --files_map_file files_map.json --anomalies_output_file anomalies.json
Without DBScan:
python3 -s anomaly_selection --differences_file distances.json --files_map_file files_map.json --anomalies_output_file anomalies.json
If use full differences matrix (with --use_dbscan
option), then the file will be read in binary mode.
Without specify stage
If you do not specify a stage, then runs both stages.
Use arguments both stages except --differences_output_file
and --differences_file
Example of use
With DBScan:
python3 -f dataset.csv --files_map_file files_map.json --split_percent 0.9 --encoding_dim_percent 0.8 --anomalies_output_file anomalies.json
Without DBScan:
python3 --use_dbscan -f dataset.csv --files_map_file files_map.json --split_percent 0.9 --encoding_dim_percent 0.8 --anomalies_output_file anomalies.json