kafka_spark_structured_streaming icon indicating copy to clipboard operation
kafka_spark_structured_streaming copied to clipboard

Issue while running the DAG

Open abhishekpattanaik1 opened this issue 1 year ago • 10 comments

I am following kafka_spark_structured_streaming repo and try to play around the details. However I am getting the Kafka not found error. I have checked the docker image of Apache Airflow and I can see the kafka is part of the requirements.txt file. However I am not sure why I am seeing this error. Please advise in case I am missing anything from my end. Broken DAG: [/usr/local/airflow/dags/stream_to_kafka_dag.py] No module named 'kafka' @dogukannulu Please advise.

abhishekpattanaik1 avatar Jun 06 '24 05:06 abhishekpattanaik1

@dogukannulu I am also having this same issue. Please advise on how to fix. Thanks.

elaiken3 avatar Jun 06 '24 05:06 elaiken3

Hey @Abhi3linku and @elaiken3 ,

I encountered a similar issue with the requirments. I don't know if there is a fix around without changing the current structure. However, I can propose you an alternative approach.

You can initialize airflow with an entrypoint, which is a .sh script, given to the airflow image. Please take a look at my code here: docker-compose-infra.

I created my own basic airflow docker image and started it with an entrypoint, which looks like this: entrypoint.sh. This initialized airflow with the requirements you need for the dag.

Sorry again, @dogukannulu, that I promoted my work on your repository. My projects credit is all yours :)

Hope you find it helpful.

hkaanengin avatar Jun 12 '24 20:06 hkaanengin

Broken DAG: [/usr/local/airflow/dags/stream_to_kafka_dag.py] No module named 'kafka_spark_structured_streaming'

which tells me I should have one dag in one folder and the other in the docker-airflow. Is that true?

Tried continuing and hit snag at spark commit: Traceback (most recent call last): File "/opt/bitnami/spark/spark_streaming.py", line 98, in write_streaming_data() File "/opt/bitnami/spark/spark_streaming.py", line 92, in write_streaming_data df = create_initial_dataframe(spark) File "/opt/bitnami/spark/spark_streaming.py", line 52, in create_initial_dataframe return df UnboundLocalError: local variable 'df' referenced before assignment I have no name!@93945738dee6:/opt/bitnami/spark$

NTC4818 avatar Sep 24 '24 15:09 NTC4818

Hello @NTC4818, if you encounter any issues while running containers within multiple yaml files, please use the approach I used here: https://github.com/dogukannulu/crypto_api_kafka_airflow_streaming/blob/main/docker-compose.yaml This will help you create all tool containers in a single file.

Regarding the issue you are seeing, please delete the container and build it from the beginning. Because if you run that without specifying the name, it will be created with the folder name

dogukannulu avatar Sep 24 '24 15:09 dogukannulu

I think what is happening is they are not in the right structure maybe?

I just followed the docker commands as mentioned in the post. I was able to move docker-compose-LocalExecutor into the docker-airflow. That worked, in docker-airflow, I ran docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow . then docker-compose -f docker-compose-LocalExecutor.yml up -d

cd'd out, ran this docker-compose up -d as your repo level.

then moved both files specified into dags folder inside docker-airflow, as that is the host of the airflow container, but stream_to_kafka_dag contains

from kafka_spark_structured_streaming.dags.stream_to_kafka import start_streaming

meaning this file should be in the folder dag outside of docker-airflow. Right?

NTC4818 avatar Sep 24 '24 16:09 NTC4818

Sorry this may be annoying but I am a data engineer and really want to get better at building pipelines. Its so hard to find deliberate practice like this, so thank you. And I am enjoying it doesn't run perfect, but I am just a little stuck.

If it ran perfect, I would've learned WAY less.

NTC4818 avatar Sep 24 '24 16:09 NTC4818

I have two suggestions here:

  • You may keep your scripts (e.g. stream_to_kafka) under the dags folder and import each method from your script. For example from stream_to_kafka import start_streaming
  • The other option: Since this app works inside a container environment, you can use a Dockerfile where you copy your local files under a directory under your container and create a volume out of them just like I did here: https://github.com/dogukannulu/crypto_api_kafka_airflow_streaming/blob/main/docker-compose.yaml#L33 The second option is not recommended since we are trying to build a production alike app here.

dogukannulu avatar Sep 24 '24 17:09 dogukannulu

Yes that makes good sense to change the import. Maybe that will trigger the dag to be setup correctly, I will tear down and rebuild, I appreciate the input I will surely send another message shortly!

NTC4818 avatar Sep 24 '24 18:09 NTC4818

Even after attempting to install on Docker container running airflow I get this error. Broken DAG: [/usr/local/airflow/dags/stream_to_kafka_dag.py] No module named 'kafka'

NTC4818 avatar Sep 25 '24 01:09 NTC4818

Hello @NTC4818 , To resolve the issue, I propose you a different approach:

  • Create a folder named 'script' in the root of your project.
  • In the script folder, create 'entrypoint.sh' file and paste whats in here

This shell script will act as an entry point to your webserver, download the necessary framework place in your requirements.txt file

Mind the command & entrypoints variables of the image. Make sure some of the variables(such as ports, volumes, environments) are configured correctly for your project.

  • Then delete all docker images and re-create them again.

Please let me know if that fixes your problem.

hkaanengin avatar Sep 28 '24 08:09 hkaanengin