example_dataproc_twitter
example_dataproc_twitter copied to clipboard
This repository is used as source code for the medium post about implementing a Twitter recommender system using GCP.
Implementation of a Twitter Based Recommender System Fully Serverless Using GCP
This repository is used as source code for the medium post about implementing a Twitter recommender system using GCP.
We used 5 main tools from Google Cloud Platform: Dataflow, Dataproc, AppEngine, Bigquery and Cloud Storage
AppEngine (GAE)
Basically it all starts in the gae
folder. There you will find definitions of yaml files for both of our environments standard and flexible.
These are the yamls for the standard:
- cron.yaml: defines the cron executions and their respective times to run.
- queue.yaml: defines the rate at which queued tasks are executed.
- main.yaml: this is our main service that receives the cron requests and build scheduled tasks that are put into queue. In this project, we worked with push queues.
-
worker.yaml: this service is the one that executes our tasks in background. The cron triggers
main.py
that by turn callsworker.py
that finally is responsible for sending tasks to the queue.
And finally this is our only service defined in flexible environment since it uses Cython to speed up the computation of recommendations:
- recommender.yaml: Makes final recommendations for customers.
We have two distinct requirements for GAE, requirements.txt
is targeted to the flexible deployment. Notice also that we have a config_template.py
that should be used as a guide to create the file /gae/config.py
.
Here we have the following available crons:
- Exportage of customers data from BigQuery to Google Cloud Storage (GCS), defined under the route
/export_customers
inworker.py
- Creation of Dataproc Cluster, execution of DIMSUM algorithm, deletion of cluster and initializaiton of Dataflow pipeline in route
/dataproc_dimsum
also inworker.py
.
To deploy these files just run:
gcloud app deploy app.yaml worker.yaml
gcloud app deploy cron.yaml
gcloud app deploy queue.yaml
gcloud app deploy recommender.yaml
Unit Tests for GAE
Running unit testing in this project is quite simple, just install nox by running:
pip install nox-automation
And then we have general unit tests and system unit tests (which makes real connections to BigQuery and so on).
To run regular unit tests just run:
nox -s unit_gae
And system tests:
nox -s system_gae
- Unit test requires gcloud to be installed in
/gcloud-sdk/
for simulating AppEngine server locally. - System test requires a service key with Editor access to BigQuery and GCS.
Dataproc
Right after the cron exporting data from BQ to GCS is executed it starts the Dataproc one.
The main subfolder in /dataproc
is the /jobs
where we have 3 main jobs:
-
naive.py
: This is the naive implementation with O(mL*L) -
df_naive.py
: This was an attempt to implement naive approach but using Spark Dataframes. It failed ;)... -
dimsum.py
: DIMSUM implementation in PySpark following the work of Rezah Zadeh.
There's no config file here as the setup is received as input by the cron job and the config in GAE folder. For instance (as in cron file):
/run_job/run_dimsum/?url=/dataproc_dimsum&target=worker&extended_args=--days_init=30,--days_end=1,--threshold=0.1&force=no
Important note: threshold is equivalent to the inverse of "gamma" value discussed in Rezah's paper, it basically asserts a threshold from where everything above this value is guaranteed to converge with relative bound error of 20%.
Unit Tests for Dataproc
Running unit tests will require a local Spark Cluster; this Docker image is recommended. Just run it like:
docker run -it --rm -p 8888:8888 -e GRANT_SUDO=yes --user root jupyter/pyspark-notebook bash
Once there, you can clone the repository, install nox and run the tests with the command:
nox -s system_dataproc
Dataflow
After Dataproc job is completed a task is schedule through the URL: /prepare_datastore
which expects a Dataflow template to be available in a path specified in gae/config.py
.
To create this template, make sure to install apache-beam by running:
pip install apache-beam
pip install apache-beam[gcp]
After that, make sure to create a file in /dataflow/config.py
using /dataflow/config_template.py
as a guideline with values for your project in GCP.
All this being done just run:
python build_datastore_template
and the template will be saved where the config file specified it to.
Unit Tests for Dataflow
Just make sure to have nox installed and run:
nox -s system_dataflow
Haivng a service account is also required.