Spring2024_Data_Streaming_Platform_with_Apache_Kafka
Description: The project involves setting up Apache Kafka and a PostgreSQL database encapsulated in a single or multiple Docker containers with docker-compose to establish a streamlined data streaming platform. Utilize Python to fetch data from an external source, format it for Kafka ingestion (Topics), and configure producers for efficient data transfer into Kafka topics. Python-based Kafka consumers will perform some EDA using Jupyter notebook, process and validate the data before storing it into PostgreSQL, utilizing a predefined schema. The goal is to create a reliable system that seamlessly downloads, processes, and securely stores external data in real-time using Kafka as the intermediary, Python for logic handling, and Docker for deployment flexibility.
Hi Prof. @gpsaggese and @Shaunak01 Instead of using Jupyter notebook as my consumer, I want to use just Python script running inside a container as my consumer that reads data from Kakfa stream. The script will do some validations before ingesting it to Postgres. Is this okay?
Yes seems okay to me.
On Wed, Apr 17, 2024 at 6:50 PM Heanh Sok @.***> wrote:
Hi Prof. @gpsaggese https://github.com/gpsaggese and @Shaunak01 https://github.com/Shaunak01 Instead of using Jupyter notebook as my consumer, I want to use just Python script running inside a container as my consumer that reads data from Kakfa stream. The script will do some validations before ingesting it to Postgres. Is this okay?
— Reply to this email directly, view it on GitHub https://github.com/kaizen-ai/kaizenflow/issues/779#issuecomment-2062611139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNPRFRQHVB6R6EL5OVTS33Y534DZAVCNFSM6AAAAABGCOTZ7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGYYTCMJTHE . You are receiving this because you were mentioned.Message ID: @.***>
Following are the progress I have made:
- create docker-compose file that defined all the necessary containers (such as Zookeeper, Kafka Broker, Jupyter, Schema Registry, Postgres) for the project
- create a Kafka Tutorial in Jupyter notebook
- create producer and consumer that continuously producing and consuming stream of messages
- write draft of project documentation which includes the following sections:
- Overview
- Technologies Used
- Project Structure
- Docker Implementation
- How to Run
- Kafka Tutorial
- Implementing Data Streaming Platform
- Cleaning Up
TODO
- Add example of schema validation for producer and consumer using Apache Avro as data serialization
- Add example of data validation logic in our consumer
- Improve README by adding explanation on concepts such as Kafka Cluster, Topics, Messages, Partitioning, Replication Factor, Producer, Consumer, Consumer Group, Schema Registry
- Add explanation to the Implementing Data Streaming Platform Section
- Add comments to code
Additional Completed Items:
- Add example of schema validation for producer and consumer using Apache Avro as data serialization Add example of data validation logic in our consumer
- Improve README by adding explanation on concepts such as Kafka Cluster, Topics, Messages, Partitioning, Replication Factor, Producer, Consumer, Consumer Group, Schema Registry
- Add explanation to the Implementing Data Streaming Platform Section
TODO:
- Add example of data validation logic in our consumer
- Add comments to code
Additional Completed Items:
- Add example of data validation logic in our consumer
- Add comments to code
TODO:
- Record a short video explaining the project and put in the PR
Additional Completed Items:
- Record a short video walkthrough of the project.
Project is completed.