spark-deep-learning icon indicating copy to clipboard operation
spark-deep-learning copied to clipboard

Error while importing sparkdl in google colab

Open jai-dewani opened this issue 4 years ago • 3 comments

Here is the error call back while importing sparkdl

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-4a9be7b8a3d0> in <module>()
----> 1 import sparkdl

1 frames
/usr/local/lib/python3.6/dist-packages/sparkdl/image/imageIO.py in <module>()
     23 
     24 # pyspark
---> 25 from pyspark import Row
     26 from pyspark import SparkContext
     27 from pyspark.sql.types import (BinaryType, IntegerType, StringType, StructField, StructType)

ModuleNotFoundError: No module named 'pyspark'

Spark version -> sparkdl-0.2.2

jai-dewani avatar Apr 30 '20 16:04 jai-dewani

Hey @jai-dewani this is expected behavior. Google colab's environment doesn't include all of spark's dependencies, including pyspark, hence the ModuleNotFoundError. You'll need to install these dependencies first.

This repo (https://github.com/asifahmed90/pyspark-ML-in-Colab) has an example of that, but it's a bit dated, so you might ask @asifahmed90 if you run into any issues. Good luck!

scook12 avatar May 07 '20 20:05 scook12

Actually I did all the necessary steps from the start yet I am ending with this problem,
Here is the link to my collab notebook https://colab.research.google.com/drive/1nYq-rv6MT78UaiQPcSaFT-PHpsgVBe7R?usp=sharing

While running the document, just run the first two subsections and you will end up with eh the same result. I am looking hard for any minor mistake I could be doing or something I missed out on, but can't seem to find something :/

Edit: A similar issue has been posted with the same problem #209 AttributeError: module 'sparkdl' has no attribute 'graph'

jai-dewani avatar May 08 '20 17:05 jai-dewani

@jai-dewani, This setup worked for me.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

I have come around to that solution by looking into latest spark package distribution page. You can do the same by checking out https://downloads.apache.org/spark/ and look out for latest version of spark and hadoop. Ex: spark-X.X.X/spark-X.X.X-bin-hadoopX.X.tgz.

Change these filenames in the above code as required.

sainikhileshreddy avatar May 14 '21 07:05 sainikhileshreddy