spamfilter
spamfilter copied to clipboard
DEPRECATED: Go to https://github.com/prodicus/spammy for DEV version
spamfilter
unmaintained: Please go to spammy for the latest development version which is pip installable
spamfilter is our Machine Learning project, where we build a custom Naive Bayes classifier to classify email into ham or spam.
Trained on close to 33,000 training emails
Feature sets
- CAPSLOCK
- attachments
- numbers
- Links
- Words in text
You can use the pickled classifier objects to classify mail into spam or ham. (Refer the DEMO and API usage guide for details)

Index
- Development
- Installing the dependencies
- Downloading the NLTK corpora
- Check whether you have everything set up
- Running the classifier
- Loading the saved classifier
- Manually running and training the classifier
- API usage
- Custom NB classifier API
- Textblob API
- Training classifier on your own dataset
- FAQ
- Accuracy of the classifier
- Regarding the dataset
- To the contributers
- Ideas
- Legal stuff
Development
:arrow_up: Back to top
Installing the dependencies
I prefer to use virtualenv's for keeping the global python interpreter clutter free. But you are free to do a system wide install for the dependencies.
$ git clone https://github.com/prodicus/spamfilter/ && cd spamfilter
$ pip install -r requirements.txt
Downloading the NLTK corpora
>>> import nltk
>>> nltk.download('stopwords')
Check whether you have everything set up
>>> from termcolor import colored
>>> import bs4
>>> from nltk.corpus import stopwords
>>> from nltk import stem
>>>
If the above imports work without giving you an error, you are good to go!
Running the classifier
:arrow_up: Back to top
After installing the dependencies make sure that you have make installed on your system
Loading the saved classifier
A trained classifier object, trained on the full_corpus dataset (close to 33,000 emails) can be loaded and used for classifying.
$ make pickle_run
Watch and lay back!
Manually running and training the classifier
$ make run
What this does is it will ask you which dataset to train the classifier upon.
And after it is trained, which dataset to test the classifier upon.
NOTE: For those not having make installed. You will have to do a
$ python test.pyfor$ make run$ python test_classifier_pickle.pyfor$ make pickle_run
API usage
:arrow_up: Back to top
Custom NB classifier API
Refer API usage for the custom classifier (wiki) for implementation details
Textblob API
Refer API usage for the textblob classifier (wiki) for implementation details
##Training classifier on your own dataset :arrow_up: Back to top
You can train the classifier on your own dataset!
Step 1
Put your dataset folder (eg: data_foo) inside the data folder
$ tree data/corpus2/ -L 1
data/data_foo/
├── ham
└── spam
Step 2
-
specify the folder name of your newly added dataset and the name of the pickle file to be created here here in file
create_pickle.py -
Choose the number of files to train the classifier againt here in file
create_pickle.py
Step 3
$ make pickle

FAQ
:arrow_up: Back to top
Accuracy of the classifier
I ran it one too many times apparantly and the accuracy is generally between
| Accuracy | |
|---|---|
| Spam | 80 to 94% |
| Ham | 70 to 80% |
Watch the classifier in action here
Regarding the dataset
The dataset used is the Enron dataset.
We Trained our spam_classifier.pickle classifier object against the full_corpus dataset and then cross validated the pickled classifier with any of the datasets present in the data directory
Read more about the directory structure here
To the contributers
:arrow_up: Back to top
Refer CONTRIBUTING.md for more details
Ideas
:arrow_up: Back to top
- [ ] Deploying a full blown app to heroku
- [ ] ~~To make a voting system which will take the best out of all the classifiers (increasing the accuracy is the aim)~~
- [x] Try out
textbloband see how it performs with our classifier - [x] ~~To decide on whether to use
clintortermcolor~~ Using colorama as explained in commit 89da4cd - [ ] Try implementing some of the algorithms using scikit learn
Legal Stuff
:arrow_up: Back to top
Open sourced under GPLv3
spamfilter
Copyright (C) 2016 Tasdik Rahman([email protected])
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
You can a copy of the LICENSE file HERE