kaizenflow Spring2024_Data_Streaming_Platform_with_Apache

Description: The project involves setting up Apache Kafka and a PostgreSQL database encapsulated in a single or multiple Docker containers with docker-compose to establish a streamlined data streaming platform. Utilize Python to fetch data from an external source, format it for Kafka ingestion (Topics), and configure producers for efficient data transfer into Kafka topics. Python-based Kafka consumers will perform some EDA using Jupyter notebook, process and validate the data before storing it into PostgreSQL, utilizing a predefined schema. The goal is to create a reliable system that seamlessly downloads, processes, and securely stores external data in real-time using Kafka as the intermediary, Python for logic handling, and Docker for deployment flexibility.

LINK

Apr 11 '24 14:04 heanhsok

Hi Prof. @gpsaggese and @Shaunak01 Instead of using Jupyter notebook as my consumer, I want to use just Python script running inside a container as my consumer that reads data from Kakfa stream. The script will do some validations before ingesting it to Postgres. Is this okay?

Apr 17 '24 22:04 heanhsok

Yes seems okay to me.

On Wed, Apr 17, 2024 at 6:50 PM Heanh Sok @.***> wrote:

Hi Prof. @gpsaggese https://github.com/gpsaggese and @Shaunak01 https://github.com/Shaunak01 Instead of using Jupyter notebook as my consumer, I want to use just Python script running inside a container as my consumer that reads data from Kakfa stream. The script will do some validations before ingesting it to Postgres. Is this okay?

— Reply to this email directly, view it on GitHub https://github.com/kaizen-ai/kaizenflow/issues/779#issuecomment-2062611139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNPRFRQHVB6R6EL5OVTS33Y534DZAVCNFSM6AAAAABGCOTZ7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGYYTCMJTHE . You are receiving this because you were mentioned.Message ID: @.***>

Apr 18 '24 16:04 Shaunak01

Following are the progress I have made:

create docker-compose file that defined all the necessary containers (such as Zookeeper, Kafka Broker, Jupyter, Schema Registry, Postgres) for the project
create a Kafka Tutorial in Jupyter notebook
create producer and consumer that continuously producing and consuming stream of messages
write draft of project documentation which includes the following sections:
- Overview
- Technologies Used
- Project Structure
- Docker Implementation
- How to Run
- Kafka Tutorial
- Implementing Data Streaming Platform
- Cleaning Up

TODO

Add example of schema validation for producer and consumer using Apache Avro as data serialization
Add example of data validation logic in our consumer
Improve README by adding explanation on concepts such as Kafka Cluster, Topics, Messages, Partitioning, Replication Factor, Producer, Consumer, Consumer Group, Schema Registry
Add explanation to the Implementing Data Streaming Platform Section
Add comments to code

Apr 26 '24 20:04 heanhsok

Additional Completed Items:

Add example of schema validation for producer and consumer using Apache Avro as data serialization Add example of data validation logic in our consumer
Improve README by adding explanation on concepts such as Kafka Cluster, Topics, Messages, Partitioning, Replication Factor, Producer, Consumer, Consumer Group, Schema Registry
Add explanation to the Implementing Data Streaming Platform Section

TODO:

Add example of data validation logic in our consumer
Add comments to code

May 02 '24 20:05 heanhsok

Additional Completed Items:

Add example of data validation logic in our consumer
Add comments to code

TODO:

Record a short video explaining the project and put in the PR

May 07 '24 16:05 heanhsok

Additional Completed Items:

Record a short video walkthrough of the project.

Project is completed.

May 09 '24 20:05 heanhsok

Spring2024_Data_Streaming_Platform_with_Apache_Kafka