atarashi icon indicating copy to clipboard operation
atarashi copied to clipboard

Feat(models): Implemented three models for license similarity

Open Kaushl2208 opened this issue 4 years ago • 2 comments

Description

Implementation of Logistic Regression, Multinomial Naive Bayes and Linear SVC on license dataset licenseList.csv. The main purpose of implementing this idea was to plan for a model which can make atarashi faster and more accurate.

Files

  • train.py (Training the models and saving in binary)
  • test.py ( For the testing purpose)
  • lr_model.pkl (Binary file for logistic regression)
  • nb_model.pkl(Binary file for Multinomial Naive Bayes)
  • svc_model.pkl(Binary file for Linear SVC)
  • vectorizer.pkl (Binary file for storing vocabulary)

How to use?

  • Test the models

    • atarashi -a lr_classifier path/to/file (Logistic Regression)
    • atarashi -a nb_classifier path/to/file (Multinomial Naive Bayes)
    • atarashi -a svc_classifier path/to/file (Linear SVC)
  • Train the models (Optional)

    • From the base folder run : python3 atarashi/agents/models/train.py

ToDo

  • [x] Test working and accuracy of the algorithms using evaluator.py

  • [x] proper integration with atarashii.py

Accuracy Score

Model Name Accuracy Score in % Time taken on 100 files in (sec)
Logistic Regression 31 88.6
Linear SVC 36 79.4
Multinomial Naive Bayes 30 83.72

Future Scope

  • The well-defined dataset will increase the similarity accuracy even more. By well-defined dataset I mean with newly updated licenses also ( 1 class to n License) style license file will do the work.

CC: @hastagAB @GMishx @ag4ums

Signed off by: Kaushlendra Pratap Singh [email protected]

Kaushl2208 avatar Aug 11 '20 20:08 Kaushl2208

@hastagAB @GMishx , I implemented the models command into atarashii.py but it seems like I am missing something to update somewhere in code.

Kaushl2208 avatar Aug 11 '20 20:08 Kaushl2208

@GMishx @ag4ums I have run all three models on the Test files and I am attaching the screenshot of the results.

SVC

SVC

NB

NB

Logistic Regression

LR

Kaushl2208 avatar Aug 15 '20 14:08 Kaushl2208