kafka_spark_structured_streaming
kafka_spark_structured_streaming copied to clipboard
Issue while running the DAG
I am following kafka_spark_structured_streaming repo and try to play around the details. However I am getting the Kafka not found error. I have checked the docker image of Apache Airflow and I can see the kafka is part of the requirements.txt file. However I am not sure why I am seeing this error. Please advise in case I am missing anything from my end. Broken DAG: [/usr/local/airflow/dags/stream_to_kafka_dag.py] No module named 'kafka' @dogukannulu Please advise.
@dogukannulu I am also having this same issue. Please advise on how to fix. Thanks.
Hey @Abhi3linku and @elaiken3 ,
I encountered a similar issue with the requirments. I don't know if there is a fix around without changing the current structure. However, I can propose you an alternative approach.
You can initialize airflow with an entrypoint, which is a .sh script, given to the airflow image. Please take a look at my code here: docker-compose-infra.
I created my own basic airflow docker image and started it with an entrypoint, which looks like this: entrypoint.sh. This initialized airflow with the requirements you need for the dag.
Sorry again, @dogukannulu, that I promoted my work on your repository. My projects credit is all yours :)
Hope you find it helpful.
Broken DAG: [/usr/local/airflow/dags/stream_to_kafka_dag.py] No module named 'kafka_spark_structured_streaming'
which tells me I should have one dag in one folder and the other in the docker-airflow. Is that true?
Tried continuing and hit snag at spark commit:
Traceback (most recent call last):
File "/opt/bitnami/spark/spark_streaming.py", line 98, in
Hello @NTC4818, if you encounter any issues while running containers within multiple yaml files, please use the approach I used here: https://github.com/dogukannulu/crypto_api_kafka_airflow_streaming/blob/main/docker-compose.yaml This will help you create all tool containers in a single file.
Regarding the issue you are seeing, please delete the container and build it from the beginning. Because if you run that without specifying the name, it will be created with the folder name
I think what is happening is they are not in the right structure maybe?
I just followed the docker commands as mentioned in the post. I was able to move docker-compose-LocalExecutor into the docker-airflow. That worked, in docker-airflow, I ran docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow . then docker-compose -f docker-compose-LocalExecutor.yml up -d
cd'd out, ran this docker-compose up -d as your repo level.
then moved both files specified into dags folder inside docker-airflow, as that is the host of the airflow container, but stream_to_kafka_dag contains
from kafka_spark_structured_streaming.dags.stream_to_kafka import start_streaming
meaning this file should be in the folder dag outside of docker-airflow. Right?
Sorry this may be annoying but I am a data engineer and really want to get better at building pipelines. Its so hard to find deliberate practice like this, so thank you. And I am enjoying it doesn't run perfect, but I am just a little stuck.
If it ran perfect, I would've learned WAY less.
I have two suggestions here:
- You may keep your scripts (e.g. stream_to_kafka) under the dags folder and import each method from your script. For example from stream_to_kafka import start_streaming
- The other option: Since this app works inside a container environment, you can use a Dockerfile where you copy your local files under a directory under your container and create a volume out of them just like I did here: https://github.com/dogukannulu/crypto_api_kafka_airflow_streaming/blob/main/docker-compose.yaml#L33 The second option is not recommended since we are trying to build a production alike app here.
Yes that makes good sense to change the import. Maybe that will trigger the dag to be setup correctly, I will tear down and rebuild, I appreciate the input I will surely send another message shortly!
Even after attempting to install on Docker container running airflow I get this error. Broken DAG: [/usr/local/airflow/dags/stream_to_kafka_dag.py] No module named 'kafka'
Hello @NTC4818 , To resolve the issue, I propose you a different approach:
- Create a folder named 'script' in the root of your project.
- In the script folder, create 'entrypoint.sh' file and paste whats in here
This shell script will act as an entry point to your webserver, download the necessary framework place in your requirements.txt file
- Reconfigure your airflow image like web-server image
Mind the command & entrypoints variables of the image. Make sure some of the variables(such as ports, volumes, environments) are configured correctly for your project.
- Then delete all docker images and re-create them again.
Please let me know if that fixes your problem.