spark-on-lambda
spark-on-lambda copied to clipboard
Python example file's data location does not meet Lambda's expectation
I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.
Now when I run ./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py
under the driver folder, Spark's log shows java.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist
.
I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under /tmp/lambda
, I looked at what actually was under /tmp/lambda
. There was a spark
folder. So my work-around was to create a temporary /tmp/lambda/spark/data/mllib/
under my EC2, move my data file there, and then point to that file in spark.read
. Specifically I changed line 42 to
import os
data_folder = '/home/ec2-user/driver/data/mllib'
lambda_folder = '/tmp/lambda/spark/data/mllib'
filename = 'sample_kmeans_data.txt'
os.system('mkdir -p ' + lambda_folder)
os.system('cp {}/{} {}/{}'.format(data_folder, filename, lambda_folder, filename))
dataset = spark.read.format("libsvm").load('{}/{}'.format(lambda_folder, filename))
And then it worked fine.
I suppose that part or many Python files has this problem, so it can be a barrier for python users.