Implementation of a Twitter Based Recommender System Fully Serverless Using GCP

This repository is used as source code for the medium post about implementing a Twitter recommender system using GCP.

We used 5 main tools from Google Cloud Platform: Dataflow, Dataproc, AppEngine, Bigquery and Cloud Storage

AppEngine (GAE)

Basically it all starts in the gae folder. There you will find definitions of yaml files for both of our environments standard and flexible. These are the yamls for the standard:

cron.yaml: defines the cron executions and their respective times to run.
queue.yaml: defines the rate at which queued tasks are executed.
main.yaml: this is our main service that receives the cron requests and build scheduled tasks that are put into queue. In this project, we worked with push queues.
worker.yaml: this service is the one that executes our tasks in background. The cron triggers main.py that by turn calls worker.py that finally is responsible for sending tasks to the queue.

And finally this is our only service defined in flexible environment since it uses Cython to speed up the computation of recommendations:

recommender.yaml: Makes final recommendations for customers.

We have two distinct requirements for GAE, requirements.txt is targeted to the flexible deployment. Notice also that we have a config_template.py that should be used as a guide to create the file /gae/config.py.

Here we have the following available crons:

Exportage of customers data from BigQuery to Google Cloud Storage (GCS), defined under the route /export_customers in worker.py
Creation of Dataproc Cluster, execution of DIMSUM algorithm, deletion of cluster and initializaiton of Dataflow pipeline in route /dataproc_dimsum also in worker.py.

To deploy these files just run:

gcloud app deploy app.yaml worker.yaml
gcloud app deploy cron.yaml
gcloud app deploy queue.yaml
gcloud app deploy recommender.yaml

Unit Tests for GAE

Running unit testing in this project is quite simple, just install nox by running:

pip install nox-automation

And then we have general unit tests and system unit tests (which makes real connections to BigQuery and so on).

To run regular unit tests just run:

nox -s unit_gae

And system tests:

nox -s system_gae

Unit test requires gcloud to be installed in /gcloud-sdk/ for simulating AppEngine server locally.
System test requires a service key with Editor access to BigQuery and GCS.

Dataproc

Right after the cron exporting data from BQ to GCS is executed it starts the Dataproc one. The main subfolder in /dataproc is the /jobs where we have 3 main jobs:

naive.py: This is the naive implementation with O(mL*L)
df_naive.py: This was an attempt to implement naive approach but using Spark Dataframes. It failed ;)...
dimsum.py: DIMSUM implementation in PySpark following the work of Rezah Zadeh.

There's no config file here as the setup is received as input by the cron job and the config in GAE folder. For instance (as in cron file):

/run_job/run_dimsum/?url=/dataproc_dimsum&target=worker&extended_args=--days_init=30,--days_end=1,--threshold=0.1&force=no

Important note: threshold is equivalent to the inverse of "gamma" value discussed in Rezah's paper, it basically asserts a threshold from where everything above this value is guaranteed to converge with relative bound error of 20%.

Unit Tests for Dataproc

Running unit tests will require a local Spark Cluster; this Docker image is recommended. Just run it like:

docker run -it --rm -p 8888:8888 -e GRANT_SUDO=yes --user root jupyter/pyspark-notebook bash

Once there, you can clone the repository, install nox and run the tests with the command:

nox -s system_dataproc

Dataflow

After Dataproc job is completed a task is schedule through the URL: /prepare_datastore which expects a Dataflow template to be available in a path specified in gae/config.py.

To create this template, make sure to install apache-beam by running:

pip install apache-beam
pip install apache-beam[gcp]

After that, make sure to create a file in /dataflow/config.py using /dataflow/config_template.py as a guideline with values for your project in GCP.

All this being done just run:

python build_datastore_template

and the template will be saved where the config file specified it to.

Unit Tests for Dataflow

Just make sure to have nox installed and run:

nox -s system_dataflow

Haivng a service account is also required.

example_dataproc_twitter
example_dataproc_twitter copied to clipboard

Metadata

Implementation of a Twitter Based Recommender System Fully Serverless Using GCP

AppEngine (GAE)

Unit Tests for GAE

Dataproc

Unit Tests for Dataproc

Dataflow

Unit Tests for Dataflow

← Metadata

Owner

Metadata

example_dataproc_twitter example_dataproc_twitter copied to clipboard

Metadata

Implementation of a Twitter Based Recommender System Fully Serverless Using GCP

AppEngine (GAE)

Unit Tests for GAE

Dataproc

Unit Tests for Dataproc

Dataflow

Unit Tests for Dataflow

← Metadata

Owner

Metadata

example_dataproc_twitter
example_dataproc_twitter copied to clipboard