sourceclassifier
sourceclassifier copied to clipboard
Use a Bayesian classifier to determine source code language
h1. SourceClassifier
Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the "Computer Language Benchmarks Game":http://shootout.alioth.debian.org/ . It is written in Ruby and availabe as a gem. To train the classifier to identify new languages download the sources from github.
Out of the box SourceClassifier recognises Css, C, Java, Javascript, Perl, Php, Python and Ruby.
h2. Usage
First install the gem using github as a source
$ gem sources -a http://gems.github.com $ sudo gem install chrislo-sourceclassifier
Then, to use
require 'rubygems' require 'sourceclassifier' s = SourceClassifier.new ruby_text = int main() { write(1, "hello world\n", 12); return(0); } EOT s.identify(ruby_text) #=> Ruby s.identify(c_text) #=> Gcc
h2. Training
Download the sources from github and in the directory run the training rake test
$ rake train
In the ./sources directory are subdirectories for each language you wish to be able to identify. Each subdirectory contains examples of programs written in that language. The name of the directory is significant - it is the value returned by the SourceClassifier.identify() method.
The rake task populate:shootout can be used to build these subdirectories from a checkout of the "computer language shootout sources":http://alioth.debian.org/scm/?group_id=30402 but you are free to train the classifier using any available examples. Edit the Rakefile to point to your checkout of the shootout sources
Run rake populate:css to grab the css files used to train the classifier from csszengarden.com.
To populate the sources directory using all available sources run
$ rake populate:all
h2. Acknowledgments
This library depends heavily on the great "Classifier":http://classifier.rubyforge.org/ gem by Lucas Carlson and David Fayram II.
h2. License
This gem is released under the MIT license (see LICENSE). The training examples retain their original licenses.