containers
containers copied to clipboard
Cannot add additional JAR files (hadoop-azure-3.3.1.jar, azure-storage-8.6.6.jar) to custom container image
Name and Version
bitnami/spark:3
What steps will reproduce the bug?
-
Create custom spark container image with additional JAR files
-
Attached docker file has hadoop-azure-3.3.1.jar, azure-storage-8.6.6.jar, and dependencies Dockerfile.spark.txt
-
Run produced custom docker image using command similar to below
docker run --rm -it <custom_image_from_above_step>
- Open another terminal and connect to the running docker container in step 3 bash
- Run below command
pyspark --packages org.apache.hadoop:hadoop-azure:3.3.1,com.microsoft.azure:azure-storage:8.6.6
What is the expected behavior?
packages must loaded
What do you see instead?
:hadoop-azure:3.3.1,com.microsoft.azure:azure-storage:8.6.6
Python 3.8.13 (default, Apr 11 2022, 12:27:15)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/opt/bitnami/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /opt/bitnami/spark/.ivy2/cache
The jars for the packages stored in: /opt/bitnami/spark/.ivy2/jars
org.apache.hadoop#hadoop-azure added as a dependency
com.microsoft.azure#azure-storage added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0aaec6b9-5d40-4793-b19d-35af6d4d7168;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/bitnami/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-0aaec6b9-5d40-4793-b19d-35af6d4d7168-1.0.xml (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:71)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:63)
at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:553)
at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:183)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:259)
at org.apache.ivy.Ivy.resolve(Ivy.java:522)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1445)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/opt/bitnami/spark/python/pyspark/shell.py", line 35, in <module>
SparkContext._ensure_initialized() # type: ignore
File "/opt/bitnami/spark/python/pyspark/context.py", line 339, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/opt/bitnami/spark/python/pyspark/java_gateway.py", line 108, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
Additional information
FROM bitnami/spark:3
USER root
# Download hadoop-azure, azure-storage, and dependencies (See above)
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.1/hadoop-azure-3.3.1.jar --output /opt/bitnami/spark/jars/hadoop-azure-3.3.1.jar
RUN curl https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/8.6.6/azure-storage-8.6.6.jar --output /opt/bitnami/spark/jars/azure-storage-8.6.6.jar
RUN curl https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar --output /opt/bitnami/spark/jars/httpclient-4.5.13.jar
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar --output /opt/bitnami/spark/jars/hadoop-shaded-guava-1.1.1.jar
RUN curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util-ajax/9.4.40.v20210413/jetty-util-ajax-9.4.40.v20210413.jar --output /opt/bitnami/spark/jars/jetty-util-ajax-9.4.40.v20210413.jar
RUN curl https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar --output /opt/bitnami/spark/jars/jackson-mapper-asl-1.9.13.jar
RUN curl https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar --output /opt/bitnami/spark/jars/jackson-core-asl-1.9.13.jar
RUN curl https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar --output /opt/bitnami/spark/jars/wildfly-openssl-1.0.7.Final.jar
RUN curl https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.13/httpcore-4.4.13.jar --output /opt/bitnami/spark/jars/httpcore-4.4.13.jar
RUN curl https://repo1.maven.org/maven2/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar --output /opt/bitnami/spark/jars/commons-logging-1.1.3.jar
RUN curl https://repo1.maven.org/maven2/commons-codec/commons-codec/1.11/commons-codec-1.11.jar --output /opt/bitnami/spark/jars/commons-codec-1.11.jar
RUN curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util/9.4.40.v20210413/jetty-util-9.4.40.v20210413.jar --output /opt/bitnami/spark/jars/jetty-util-9.4.40.v20210413.jar
RUN curl https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.9.4/jackson-core-2.9.4.jar --output /opt/bitnami/spark/jars/jackson-core-2.9.4.jar
RUN curl https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.12/slf4j-api-1.7.12.jar --output /opt/bitnami/spark/jars/slf4j-api-1.7.12.jar
RUN curl https://repo1.maven.org/maven2/org/apache/commons/commons-lang3/3.4/commons-lang3-3.4.jar --output /opt/bitnami/spark/jars/commons-lang3-3.4.jar
RUN curl https://repo1.maven.org/maven2/com/microsoft/azure/azure-keyvault-core/1.2.4/azure-keyvault-core-1.2.4.jar --output /opt/bitnami/spark/jars/azure-keyvault-core-1.2.4.jar
RUN curl https://repo1.maven.org/maven2/com/google/guava/guava/24.1.1-jre/guava-24.1.1-jre.jar --output /opt/bitnami/spark/jars/guava-24.1.1-jre.jar
RUN curl https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar --output /opt/bitnami/spark/jars/jsr305-1.3.9.jar
RUN curl https://repo1.maven.org/maven2/org/checkerframework/checker-compat-qual/2.0.0/checker-compat-qual-2.0.0.jar --output /opt/bitnami/spark/jars/checker-compat-qual-2.0.0.jar
RUN curl https://repo1.maven.org/maven2/com/google/errorprone/error_prone_annotations/2.1.3/error_prone_annotations-2.1.3.jar --output /opt/bitnami/spark/jars/error_prone_annotations-2.1.3.jar
RUN curl https://repo1.maven.org/maven2/com/google/j2objc/j2objc-annotations/1.1/j2objc-annotations-1.1.jar --output /opt/bitnami/spark/jars/j2objc-annotations-1.1.jar
RUN curl https://repo1.maven.org/maven2/org/codehaus/mojo/animal-sniffer-annotations/1.14/animal-sniffer-annotations-1.14.jar --output /opt/bitnami/spark/jars/animal-sniffer-annotations-1.14.jar
ENV BITNAMI_APP_NAME=spark BITNAMI_IMAGE_VERSION=3.0.0-debian-10-r48 JAVA_HOME=/opt/bitnami/java LD_LIBRARY_PATH=/opt/bitnami/python/lib/:/opt/bitnami/spark/venv/lib/python3.6/site-packages/numpy.libs/: LIBNSS_WRAPPER_PATH=/opt/bitnami/common/lib/libnss_wrapper.so NSS_WRAPPER_GROUP=/opt/bitnami/spark/tmp/nss_group NSS_WRAPPER_PASSWD=/opt/bitnami/spark/tmp/nss_passwd SPARK_HOME=/opt/bitnami/spark
WORKDIR /opt/bitnami/spark
USER 1001
ENTRYPOINT ["/opt/bitnami/scripts/spark/entrypoint.sh"]
CMD ["/opt/bitnami/scripts/spark/run.sh"]
Is this because this issue
Hello @porscheme,
As you said, your error could be related to the permissions configuration this container uses. I tested downloading the jars in a local directory which I mounted in the container and the command you provided run successfully. You will need to create the local directory where the jars will be downloaded as explained in https://github.com/bitnami/bitnami-docker-spark/issues/7#issuecomment-650968888
The files and the docker-compose configuration I used were the following:
$ ls -la
total 8
drwxr-xr-x 5 fdepaz 160 Apr 18 17:18 .
drwxr-xr-x 50 fdepaz 1600 Apr 18 16:24 ..
-rw-r--r-- 1 fdepaz 918 Apr 18 17:12 docker-compose.yml
drwxrwxrwx 27 fdepaz 864 Apr 18 17:14 jars_dir
-rw-r--r-- 1 fdepaz 43 Apr 18 16:40 spark-defaults.conf
$ ls jars_dir
animal-sniffer-annotations-1.14.jar checker-compat-qual-2.0.0.jar guava-24.1.1-jre.jar j2objc-annotations-1.1.jar jetty-util-9.4.40.v20210413.jar
aws-java-sdk-bundle-1.11.704.jar commons-codec-1.11.jar hadoop-azure-3.3.1.jar jackson-core-2.9.4.jar jetty-util-ajax-9.4.40.v20210413.jar
azure-keyvault-core-1.2.4.jar commons-lang3-3.4.jar hadoop-shaded-guava-1.1.1.jar jackson-core-asl-1.9.13.jar jsr305-1.3.9.jar
azure-storage-8.6.6.jar commons-logging-1.1.3.jar httpclient-4.5.13.jar jackson-mapper-asl-1.9.13.jar slf4j-api-1.7.12.jar
cache error_prone_annotations-2.1.3.jar httpcore-4.4.13.jar jars wildfly-openssl-1.0.7.Final.jar
$ cat spark-defaults.conf
spark.jars.ivy /opt/bitnami/spark/jars/ivy
$ cat docker-compose.yml
version: '2'
services:
spark:
image: bitnami/spark
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
volumes:
- ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
- ./jars_dir:/opt/bitnami/spark/jars/ivy:z
spark-worker:
image: bitnami/spark
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
- ./jars_dir:/opt/bitnami/spark/jars/ivy:z
You are getting issues because user 1001 doesn't have write permissions in opt/bitnami/spark
and this would be the recommended way to solve it. An alternative would be to run the container as root, though it is not advisable.
FWIW, rather than all those curl commands, you could use mvn dependency:get
, which is effectively what --packages
should be doing
Hello @porscheme,
As you said, your error could be related to the permissions configuration this container uses. I tested downloading the jars in a local directory which I mounted in the container and the command you provided run successfully. You will need to create the local directory where the jars will be downloaded as explained in #7 (comment)
The files and the docker-compose configuration I used were the following:
$ ls -la total 8 drwxr-xr-x 5 fdepaz 160 Apr 18 17:18 . drwxr-xr-x 50 fdepaz 1600 Apr 18 16:24 .. -rw-r--r-- 1 fdepaz 918 Apr 18 17:12 docker-compose.yml drwxrwxrwx 27 fdepaz 864 Apr 18 17:14 jars_dir -rw-r--r-- 1 fdepaz 43 Apr 18 16:40 spark-defaults.conf $ ls jars_dir animal-sniffer-annotations-1.14.jar checker-compat-qual-2.0.0.jar guava-24.1.1-jre.jar j2objc-annotations-1.1.jar jetty-util-9.4.40.v20210413.jar aws-java-sdk-bundle-1.11.704.jar commons-codec-1.11.jar hadoop-azure-3.3.1.jar jackson-core-2.9.4.jar jetty-util-ajax-9.4.40.v20210413.jar azure-keyvault-core-1.2.4.jar commons-lang3-3.4.jar hadoop-shaded-guava-1.1.1.jar jackson-core-asl-1.9.13.jar jsr305-1.3.9.jar azure-storage-8.6.6.jar commons-logging-1.1.3.jar httpclient-4.5.13.jar jackson-mapper-asl-1.9.13.jar slf4j-api-1.7.12.jar cache error_prone_annotations-2.1.3.jar httpcore-4.4.13.jar jars wildfly-openssl-1.0.7.Final.jar $ cat spark-defaults.conf spark.jars.ivy /opt/bitnami/spark/jars/ivy $ cat docker-compose.yml version: '2' services: spark: image: bitnami/spark environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no ports: - '8080:8080' volumes: - ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf - ./jars_dir:/opt/bitnami/spark/jars/ivy:z spark-worker: image: bitnami/spark environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf - ./jars_dir:/opt/bitnami/spark/jars/ivy:z
You are getting issues because user 1001 doesn't have write permissions in
opt/bitnami/spark
and this would be the recommended way to solve it. An alternative would be to run the container as root, though it is not advisable.
Thanks @FraPazGal , I would love to test it out. How do we do this using Helm Charts?
using Helm Charts
Create a persistent volume with JARs or add them to a custom image like you're already doing
Hello @porscheme,
As explained by @OneCricketeer, you would need to use an initContainer where you will download the .jar
files and a persistent volume to store said .jar
. After downloading them, you can mount the used volume in the pod's main container (spark).
If you have any issues applying the solution in our chart, please do not doubt to open a ticket in our bitnami/charts
repo so we can further help you.
Hello @porscheme,
As explained by @OneCricketeer, you would need to use an initContainer where you will download the
.jar
files and a persistent volume to store said.jar
. After downloading them, you can mount the used volume in the pod's main container (spark).If you have any issues applying the solution in our chart, please do not doubt to open a ticker in our
bitnami/charts
repo so we can further help you.
Thanks @FraPazGal and @OneCricketeer for the reply.
- Does it mean your official documentation for additional JAR files doesn't work or did I mis understand the docs? Below is an extract from the docs
Installing additional jars
By default, this container bundles a generic set of jar files but the default image can be extended to add as many jars as needed for your specific use case. For instance, the following Dockerfile adds [aws-java-sdk-bundle-1.11.704.jar](https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.704):
FROM bitnami/spark
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.704/aws-java-sdk-bundle-1.11.704.jar --o
- Yes, I can try volume approach
Hi @porscheme,
Indeed, the current approach explained in the documentation is not working. I have opened an internal task to investigate whether the documentation is obsolete or the approach should still be valid and we need to make some changes to the container. I'll leave this ticket on-hold
so we can notify any news regarding this in the future.
We are going to transfer this issue to bitnami/containers
In order to unify the approaches followed in Bitnami containers and Bitnami charts, we are moving some issues in bitnami/bitnami-docker-<container>
repositories to bitnami/containers
.
Please follow bitnami/containers to keep you updated about the latest bitnami images.
More information here: https://blog.bitnami.com/2022/07/new-source-of-truth-bitnami-containers.html
We add jars under native user and everything works fine.
USER 1001
# Azure SDKs, Hadoop-Azure, Delta, Wildfly SSL, common jars (delta + azure hadoop)
RUN wget --quiet https://repo1.maven.org/maven2/com/microsoft/azure/adal4j/1.6.7/adal4j-1.6.7.jar -O /opt/bitnami/spark/jars/adal4j-1.6.7.jar \
&& wget --quiet https://repo1.maven.org/maven2/com/microsoft/azure/azure-annotations/1.10.0/azure-annotations-1.10.0.jar -O /opt/bitnami/spark/jars/azure-annotations-1.10.0.jar \
...
Spark doesn't run as root in bitnami image, unless you override it, so it can't access the jars you download under root. And in general, don't run containers as root :)
@ porscheme Did you try what @george-zubrienko suggests?
Hi @porscheme,
We have changed and checked the documentation. As @george-zubrienko says, user 1001 is the correct user to download the JAR files. So you can already see the modules in use:
. . .
:: resolution report :: resolve 31609ms :: artifacts dl 1917ms
:: modules in use:
com.fasterxml.jackson.core#jackson-core;2.9.4 from central in [default]
com.google.code.findbugs#jsr305;1.3.9 from central in [default]
com.google.errorprone#error_prone_annotations;2.1.3 from central in [default]
com.google.guava#guava;24.1.1-jre from central in [default]
com.google.j2objc#j2objc-annotations;1.1 from central in [default]
com.microsoft.azure#azure-keyvault-core;1.2.4 from central in [default]
com.microsoft.azure#azure-storage;8.6.6 from central in [default] # <<-----------------
commons-codec#commons-codec;1.11 from central in [default]
commons-logging#commons-logging;1.1.3 from central in [default]
org.apache.commons#commons-lang3;3.4 from central in [default]
org.apache.hadoop#hadoop-azure;3.3.1 from central in [default] # <<-----------------
org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1 from central in [default]
. . .
If you have any problems after this change, feel free to give your feedback. Thanks for your comments.