github-explorer icon indicating copy to clipboard operation
github-explorer copied to clipboard

Recommender system for Github projects using the github archive data

github-explorer

This project is a recommender system for Github, built from the archive data using Amazon Elastic MapReduce, Hadoop, Apache Pig, Piggybank, DataFu and Jython User Defined Functions (UDFs).

Getting the Github Data

To fetch all data for the years 2011-2013 from Amazon S3, run: get_all_data.rb. 404 errors are normal at the beginning of the download, as events only started in February but it begins iterating at the beginning of the year.

ETL for Pig

To get the JSON to load in Pig, we must format the JSON to one object per line. To achieve this, run newline_format.rb. This command can be run locally in < a few minutes.

Splitting Events by Type

Run split_events.pig to split events by type into a dozen /tmp/<event_type> directories. These event types can then be analyzed independently. For me the data quickly increased in size when I started looking at social data to the point that Hadoop was necessary.

Building Recommendations

Next we use Pig to create repository recommendations for all github repositories.

Run recommend.pig, which uses udfs.py to calculate a pearson distance between all repos using an inferred rating system. It is based on the example from Programming Collective Intelligence, albeit scaled for the size of the data (Pig, Hadoop, ElasticMapReduce). Pearson correlation coefficient are calculated between all users, and then a user's recommendations are weighted by this value.

The rating system is as follows:

  • Watch Repository: 1.0
  • Download Repository: 1.0
  • Create Issue: 2.0
  • Fork Repository: 3.0
  • Create Repository: 4.0

Specifically, the code is:

/* Watch events happen whenever a user 'watches' a github project */
watch_events = LOAD 's3://github-explorer/WatchEvent' AS (json: map[]);
watch_ratings = FOREACH watch_events GENERATE (chararray)$0#'actor'#'login' AS follower:chararray,
                                              (chararray)$0#'repo'#'name' AS repo:chararray,
                                              1.0 AS rating;

/* Fork events happen whenever a github project is 'forked' */
fork_events = LOAD 's3://github-explorer/ForkEvent' AS (json: map[]);
fork_ratings = FOREACH fork_events GENERATE (chararray)$0#'actor'#'login' AS follower:chararray,
                                           (chararray)$0#'repo'#'name' as repo:chararray,
                                           3.0 AS rating;

/* Download events, whenever a user downloads a tarball of a repo */
download_events = LOAD 's3://github-explorer/DownloadEvent' AS (json: map[]);
download_ratings = FOREACH download_events GENERATE (chararray)$0#'actor_attributes'#'login' AS follower:chararray,
                                                    StringConcat((chararray)$0#'repository'#'owner', '/', $0#'repository'#'name') AS repo:chararray,
                                                    1.0 AS rating;

/* Create issues events - implies a user has already downloaded/forked and tried the software */
issues_events = LOAD 's3://github-explorer/IssuesEvent' AS (json: map[]);
issues_ratings = FOREACH issues_events GENERATE (chararray)$0#'actor_attributes'#'login' AS follower:chararray,
                                                StringConcat((chararray)$0#'repository'#'owner', '/', $0#'repository'#'name') AS repo:chararray,
                                                2.0 AS rating;

/* Create repository event - strongest association with a repo possible */
create_events = LOAD 's3://github-explorer/CreateEvent' AS (json: map[]);
create_ratings = FOREACH create_events GENERATE (chararray)$0#'actor_attributes'#'login' AS follower:chararray,
                                                StringConcat((chararray)$0#'repository'#'owner', '/', $0#'repository'#'name') AS repo:chararray,
                                                4.0 AS rating;

Publishing Recommendations

Next, run load_mongo.pig, which will load MongoDB with the recommendations. You'll need to edit the paths and hostname for your mongodb server. In addition, create the following index in MongoDB:

mongo recommendations
> db.recommendations.ensureIndex({login: 1})

Running the Application

Loren ipsum

Bibliography

Programming Collective Intelligence page 7. Recommender Systems: An Introduction page 18. Machine Learning in Action, Ch. 14