authorship-detection icon indicating copy to clipboard operation
authorship-detection copied to clipboard

Evaluation of source authorship attribution tool

Authorship detection

The goal of this project is evaluation of a code2vec-based approach for authorship identification and exploring/solving issues of the existing datasets for authorship attribution of source code.

Papers used for comparision

Datasets

  • Google Code Jam submissions, C/C++/Python
  • 40 authors, Java
  • Projects mined from GitHub with a new data collection approach

The Java, C++, and Python datasets are also available here.

Project structure

Data extraction pipeline consists of two modules: Gitminer written in Python and Pathminer written in Kotlin.

  • Gitminer processes history of Git repository to extract all the blobs containing Java code.
  • Pathminer uses GumTree to parse Java code and track method changes through repo's history.
  • To extract data from GitHub projects, store names and links of GitHub projects in projects and git_projects, respectively. Then, go to runner directory and run run.py

Models and all the code for training/evaluation located in authorship_pipeline directory. To run experiments:

  • Create configuration file manually (for examples see configs directory) or edit and run generate_configs.py.
  • Run python run_classification.py configs/path/to/your/config.yaml
  • To draw graphs for evaluation on your project run draw_graphs.py --project your_project_name

To run cross-validation on new data:

  • Put source code files in datasets/datasetName/{author}/{files}. Make sure files of each author are in a single directory.
  • Run data extraction to mine path-contexts from the source files:
java -jar attribution/pathminer/extract-path-contexts.jar snapshot \
    --project datasets/datasetName/ \
    --output processed/datasetName/ \
    --java-parser antlr \
    --maxContexts 2000 --maxH 8 --maxW 3
  • Depending on the language, extracted data will be in the processed/datasetName/{c,cpp,java,py} folder.
  • To run cross-validation, create a configuration file (e.g., PbNN or PbRF)) and run python -m run_classification path/to/config in attribution/authorship_pipeline folder.

Results

IntelliJ Community

44 developers, 2000 to 10000 samples each (context separation)

evaluation on IntelliJ IDEA

44 developers, 2000 to 10000 samples each (time separation)

PbNN

evaluation on IntelliJ IDEA

PbRF

evaluation on IntelliJ IDEA

JCaliskan

evaluation on IntelliJ IDEA

21 developers, at least 10000 samples each (context separation)

evaluation on IntelliJ IDEA

21 developers, at least 10000 samples each (time separation)

PbNN

evaluation on IntelliJ IDEA

PbRF

evaluation on IntelliJ IDEA

JCaliskan

evaluation on IntelliJ IDEA

Gradle

28 developers, at least 500 samples each (context separation)

evaluation on Gradle

28 developers, at least 500 samples each (time separation)

PbNN

evaluation on Gradle

PbRF

evaluation on Gradle

JCaliskan

evaluation on Gradle

Mule

16 developers, 1000 to 5000 samples each (context separation)

evaluation on Mule

7 developers, at least 5000 samples each (context separation)

evaluation on Mule