Authorship detection

The goal of this project is evaluation of a code2vec-based approach for authorship identification and exploring/solving issues of the existing datasets for authorship attribution of source code.

Papers used for comparision

De-anonymizing Programmers via Code Stylometry | source code
Source Code Authorship Attribution using LSTM Based Networks | source code
Authorship attribution of source code by using back propagation neural network based on particle swarm optimization | source code not available

Datasets

Google Code Jam submissions, C/C++/Python
40 authors, Java
Projects mined from GitHub with a new data collection approach

The Java, C++, and Python datasets are also available here.

Project structure

Data extraction pipeline consists of two modules: Gitminer written in Python and Pathminer written in Kotlin.

Gitminer processes history of Git repository to extract all the blobs containing Java code.
Pathminer uses GumTree to parse Java code and track method changes through repo's history.
To extract data from GitHub projects, store names and links of GitHub projects in projects and git_projects, respectively. Then, go to runner directory and run run.py

Models and all the code for training/evaluation located in authorship_pipeline directory. To run experiments:

Create configuration file manually (for examples see configs directory) or edit and run generate_configs.py.
Run python run_classification.py configs/path/to/your/config.yaml
To draw graphs for evaluation on your project run draw_graphs.py --project your_project_name

To run cross-validation on new data:

Put source code files in datasets/datasetName/{author}/{files}. Make sure files of each author are in a single directory.
Run data extraction to mine path-contexts from the source files:

java -jar attribution/pathminer/extract-path-contexts.jar snapshot \
    --project datasets/datasetName/ \
    --output processed/datasetName/ \
    --java-parser antlr \
    --maxContexts 2000 --maxH 8 --maxW 3

Depending on the language, extracted data will be in the processed/datasetName/{c,cpp,java,py} folder.
To run cross-validation, create a configuration file (e.g., PbNN or PbRF)) and run python -m run_classification path/to/config in attribution/authorship_pipeline folder.

Results

IntelliJ Community

44 developers, 2000 to 10000 samples each (context separation)

evaluation on IntelliJ IDEA

44 developers, 2000 to 10000 samples each (time separation)

PbNN

evaluation on IntelliJ IDEA

PbRF

evaluation on IntelliJ IDEA

JCaliskan

evaluation on IntelliJ IDEA

21 developers, at least 10000 samples each (context separation)

evaluation on IntelliJ IDEA

21 developers, at least 10000 samples each (time separation)

PbNN

evaluation on IntelliJ IDEA

PbRF

evaluation on IntelliJ IDEA

JCaliskan

evaluation on IntelliJ IDEA

Gradle

28 developers, at least 500 samples each (context separation)

evaluation on Gradle

28 developers, at least 500 samples each (time separation)

PbNN

evaluation on Gradle

PbRF

evaluation on Gradle

JCaliskan

evaluation on Gradle

Mule

16 developers, 1000 to 5000 samples each (context separation)

evaluation on Mule

7 developers, at least 5000 samples each (context separation)

evaluation on Mule

authorship-detection
authorship-detection copied to clipboard

Metadata

Authorship detection

Papers used for comparision

Datasets

Project structure