ceja
ceja copied to clipboard
pip install Download pyspark as default & fails to work
When I pip install ceja
, I automatically get
pyspark-3.1.1.tar.gz (212.3MB)
which is a problem because it's the wrong version (using 3.0.0 on both EMR & WSL).
Even when I eliminate it, I still get errors on EMR.
Can this behavior be stopped?
[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 install ceja
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting ceja
Downloading https://files.pythonhosted.org/packages/c6/80/f372c62a83175f4c54229474f543aeca3344f4c64aab4bcfe7cf05f50cbf/ceja-0.2.0-py3-none-any.whl
Collecting pyspark>2.0.0 (from ceja)
Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
100% |████████████████████████████████| 212.3MB 6.3kB/s
Collecting jellyfish<0.9.0,>=0.8.2 (from ceja)
Downloading https://files.pythonhosted.org/packages/04/3f/d03cb056f407ef181a45569255348457b1a0915fc4eb23daeceb930a68a4/jellyfish-0.8.2.tar.gz (134kB)
100% |████████████████████████████████| 143kB 9.1MB/s
Collecting py4j==0.10.9 (from pyspark>2.0.0->ceja)
Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
100% |████████████████████████████████| 204kB 6.5MB/s
Installing collected packages: py4j, pyspark, jellyfish, ceja
Running setup.py install for pyspark ... done
Running setup.py install for jellyfish ... done
Successfully installed ceja-0.2.0 jellyfish-0.8.2 py4j-0.10.9 pyspark-3.1.1
[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 uninstall pyspark
Proceed (y/n)? y
..(snip)..
Successfully uninstalled pyspark-3.1.1
When I do above & attempt to use:
>>> df_m.columns
['guid_consumer_hashed_df10', 'guid_customer_hashed_df10', 'guidr_m', 'jws_fnm_m', 'jws_lnm_m', 'gender_m', 'state_m', 'zip3_m', 'soundex_fnm_m', 'lev_gender_m', 'lev_state_m', 'l
ev_zip3_m', 'lev_soundex_fnm_m']
jws_???_m are created with:
... .withColumn(
... "jws_fnm_m",
... ceja.jaro_winkler_similarity(f.col("firstname_df10"), f.col("firstname_df4")),
... )
I can see columns
but show
fails
>>> df_m.show()
21/03/26 06:01:50 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 40007, ip-172-31-80-99.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 589, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
ModuleNotFoundError: No module named 'jellyfish'
```
attempting install fails
```
$ sudo /usr/bin/pip3 install jellifish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting jellifish
Could not find a version that satisfies the requirement jellifish (from versions: )
No matching distribution found for jellifish
```
@yelled1 - I removed the hard dependency on PySpark, hopefully this will solve the issue.
The hard PySpark dependency caused an issue on another project as well.
I just published ceja v0.3.0. It should be in PyPi.
Can you try again and let me know if the new version solves your issue?
@MrPowers, pyspark did not download, which is great & thanks a bunch, but got jellyfish
error below.
Still, I do have it installed:
[hadoop@ip-172-31-83-44 ~]$ sudo /usr/bin/pip3 install jellyfish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Requirement already satisfied: jellyfish in /usr/local/lib/python3.7/site-packages
Also note that this is NOT an issue with WSL2, but it is in EMR (wsl2 was reinstall & EMR was fresh.
Using findspark.py
on both before initiating spark on vim.
import jellyfish works.
Just confirmed that the same error happens under spark-submit.
>>> df_m.show()
21/03/28 15:11:48 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 30005, ip-172-31-86-169.ec2.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
**ModuleNotFoundError: No module named 'jellyfish'**
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/28 15:11:48 ERROR TaskSetManager: Task 0 in stage 7.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 441, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
ModuleNotFoundError: No module named 'jellyfish'
Below will work, but this means only works for spark-submit
, while VSCode or vi repl will NOT.
mkdir $HOME/lib
pip3 install ceja -t $HOME/lib/
cd $HOME/lib/
zip -r ~/include_py_modules.zip .
cd $HOME/
/usr/bin/nohup spark-submit --packages io.delta:delta-core_2.12:0.7.0 --py-files $HOME/include_py_modules.zip --driver-memory 8g --executor-memory 8g my_python_script.py > ~/output.log 2>&1 &
Yea, perhaps vendoring Jellyfish is the best path forward to avoid the transitive dependency issue. Python packaging is difficult and even harder when Spark is added to the mix.