pyflink-nlp
pyflink-nlp copied to clipboard
Self-contained demo using PyFlink with Gensim+spaCy to find topics in the Flink User Mailing List. All you need is Docker! 🐳
Building an Analytics Pipeline with PyFlink
:warning: Update: This repository will no longer be actively maintained. Please check the Ververica fork.
See the slides for more context.
Docker
To keep things simple, this demo uses a Docker Compose setup that makes it easier to bundle up all the services you need:
Getting the setup up and running
docker-compose build
docker-compose up -d
Is everything really up and running?
docker-compose ps
You should be able to access the Flink Web UI (http://localhost:8081), as well as Superset (http://localhost:8088).
Analyzing the Flink User Mailing List
What are people asking more frequently about in the Flink User Mailing List? How can you make sense of such a huge amount of random text?
Some Background
The model in this demo was trained using a popular topic modeling algorithm called LDA and Gensim, a Python library with a good implementation of the algorithm. The trained model knows to some extent what combination of words are associated with certain topics, and can just be passed as a dependency to PyFlink.
Don't trust the model. :japanese_ogre:
Submitting the PyFlink job
docker-compose exec jobmanager ./bin/flink run -py /opt/pyflink-nlp/pipeline.py -d
Once you get the Job has been submitted with JobID <JobId> green light, you can check and monitor its execution using the Flink WebUI:

Visualizing on Superset
To visualize the results, navigate to (http://localhost:8088) and log into Superset using:
username: admin
password: superset
There should be a default dashboard named "Flink User Mailing List" listed under Dashboards:

And that's it!
For the latest updates on PyFlink, follow Apache Flink on Twitter.