containers icon indicating copy to clipboard operation
containers copied to clipboard

Cannot add additional JAR files (hadoop-azure-3.3.1.jar, azure-storage-8.6.6.jar) to custom container image

Open porscheme opened this issue 2 years ago • 10 comments

Name and Version

bitnami/spark:3

What steps will reproduce the bug?

  1. Create custom spark container image with additional JAR files

  2. Attached docker file has hadoop-azure-3.3.1.jar, azure-storage-8.6.6.jar, and dependencies Dockerfile.spark.txt

  3. Run produced custom docker image using command similar to below

docker run --rm -it <custom_image_from_above_step>

Screenshot 2022-04-14 162133 PNG

  1. Open another terminal and connect to the running docker container in step 3 bash
  2. Run below command
pyspark --packages org.apache.hadoop:hadoop-azure:3.3.1,com.microsoft.azure:azure-storage:8.6.6

What is the expected behavior?

packages must loaded

What do you see instead?

:hadoop-azure:3.3.1,com.microsoft.azure:azure-storage:8.6.6
Python 3.8.13 (default, Apr 11 2022, 12:27:15)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/opt/bitnami/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /opt/bitnami/spark/.ivy2/cache
The jars for the packages stored in: /opt/bitnami/spark/.ivy2/jars
org.apache.hadoop#hadoop-azure added as a dependency
com.microsoft.azure#azure-storage added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0aaec6b9-5d40-4793-b19d-35af6d4d7168;1.0
        confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/bitnami/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-0aaec6b9-5d40-4793-b19d-35af6d4d7168-1.0.xml (No such file or directory)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
        at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:71)
        at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:63)
        at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:553)
        at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:183)
        at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:259)
        at org.apache.ivy.Ivy.resolve(Ivy.java:522)
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1445)
        at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/opt/bitnami/spark/python/pyspark/shell.py", line 35, in <module>
    SparkContext._ensure_initialized()  # type: ignore
  File "/opt/bitnami/spark/python/pyspark/context.py", line 339, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/opt/bitnami/spark/python/pyspark/java_gateway.py", line 108, in launch_gateway
    raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number

Additional information

Screenshot 2022-04-14 162133 PNG

FROM bitnami/spark:3

USER root

# Download hadoop-azure, azure-storage, and dependencies (See above)
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.1/hadoop-azure-3.3.1.jar --output /opt/bitnami/spark/jars/hadoop-azure-3.3.1.jar
RUN curl https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/8.6.6/azure-storage-8.6.6.jar --output /opt/bitnami/spark/jars/azure-storage-8.6.6.jar
RUN curl https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar --output /opt/bitnami/spark/jars/httpclient-4.5.13.jar
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar --output /opt/bitnami/spark/jars/hadoop-shaded-guava-1.1.1.jar
RUN curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util-ajax/9.4.40.v20210413/jetty-util-ajax-9.4.40.v20210413.jar --output /opt/bitnami/spark/jars/jetty-util-ajax-9.4.40.v20210413.jar
RUN curl https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar --output /opt/bitnami/spark/jars/jackson-mapper-asl-1.9.13.jar
RUN curl https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar --output /opt/bitnami/spark/jars/jackson-core-asl-1.9.13.jar
RUN curl https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar --output /opt/bitnami/spark/jars/wildfly-openssl-1.0.7.Final.jar
RUN curl https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.13/httpcore-4.4.13.jar --output /opt/bitnami/spark/jars/httpcore-4.4.13.jar
RUN curl https://repo1.maven.org/maven2/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar --output /opt/bitnami/spark/jars/commons-logging-1.1.3.jar
RUN curl https://repo1.maven.org/maven2/commons-codec/commons-codec/1.11/commons-codec-1.11.jar --output /opt/bitnami/spark/jars/commons-codec-1.11.jar
RUN curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util/9.4.40.v20210413/jetty-util-9.4.40.v20210413.jar --output /opt/bitnami/spark/jars/jetty-util-9.4.40.v20210413.jar
RUN curl https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.9.4/jackson-core-2.9.4.jar --output /opt/bitnami/spark/jars/jackson-core-2.9.4.jar
RUN curl https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.12/slf4j-api-1.7.12.jar --output /opt/bitnami/spark/jars/slf4j-api-1.7.12.jar
RUN curl https://repo1.maven.org/maven2/org/apache/commons/commons-lang3/3.4/commons-lang3-3.4.jar --output /opt/bitnami/spark/jars/commons-lang3-3.4.jar
RUN curl https://repo1.maven.org/maven2/com/microsoft/azure/azure-keyvault-core/1.2.4/azure-keyvault-core-1.2.4.jar --output /opt/bitnami/spark/jars/azure-keyvault-core-1.2.4.jar
RUN curl https://repo1.maven.org/maven2/com/google/guava/guava/24.1.1-jre/guava-24.1.1-jre.jar --output /opt/bitnami/spark/jars/guava-24.1.1-jre.jar
RUN curl https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar --output /opt/bitnami/spark/jars/jsr305-1.3.9.jar
RUN curl https://repo1.maven.org/maven2/org/checkerframework/checker-compat-qual/2.0.0/checker-compat-qual-2.0.0.jar --output /opt/bitnami/spark/jars/checker-compat-qual-2.0.0.jar
RUN curl https://repo1.maven.org/maven2/com/google/errorprone/error_prone_annotations/2.1.3/error_prone_annotations-2.1.3.jar --output /opt/bitnami/spark/jars/error_prone_annotations-2.1.3.jar
RUN curl https://repo1.maven.org/maven2/com/google/j2objc/j2objc-annotations/1.1/j2objc-annotations-1.1.jar --output /opt/bitnami/spark/jars/j2objc-annotations-1.1.jar
RUN curl https://repo1.maven.org/maven2/org/codehaus/mojo/animal-sniffer-annotations/1.14/animal-sniffer-annotations-1.14.jar --output /opt/bitnami/spark/jars/animal-sniffer-annotations-1.14.jar

ENV BITNAMI_APP_NAME=spark BITNAMI_IMAGE_VERSION=3.0.0-debian-10-r48 JAVA_HOME=/opt/bitnami/java LD_LIBRARY_PATH=/opt/bitnami/python/lib/:/opt/bitnami/spark/venv/lib/python3.6/site-packages/numpy.libs/: LIBNSS_WRAPPER_PATH=/opt/bitnami/common/lib/libnss_wrapper.so NSS_WRAPPER_GROUP=/opt/bitnami/spark/tmp/nss_group NSS_WRAPPER_PASSWD=/opt/bitnami/spark/tmp/nss_passwd SPARK_HOME=/opt/bitnami/spark

WORKDIR /opt/bitnami/spark

USER 1001

ENTRYPOINT ["/opt/bitnami/scripts/spark/entrypoint.sh"]

CMD ["/opt/bitnami/scripts/spark/run.sh"]

porscheme avatar Apr 14 '22 23:04 porscheme

Is this because this issue

porscheme avatar Apr 15 '22 04:04 porscheme

Hello @porscheme,

As you said, your error could be related to the permissions configuration this container uses. I tested downloading the jars in a local directory which I mounted in the container and the command you provided run successfully. You will need to create the local directory where the jars will be downloaded as explained in https://github.com/bitnami/bitnami-docker-spark/issues/7#issuecomment-650968888

The files and the docker-compose configuration I used were the following:

$ ls -la
total 8
drwxr-xr-x  5 fdepaz  160 Apr 18 17:18 .
drwxr-xr-x 50 fdepaz 1600 Apr 18 16:24 ..
-rw-r--r--  1 fdepaz  918 Apr 18 17:12 docker-compose.yml
drwxrwxrwx 27 fdepaz  864 Apr 18 17:14 jars_dir
-rw-r--r--  1 fdepaz   43 Apr 18 16:40 spark-defaults.conf

$ ls jars_dir
animal-sniffer-annotations-1.14.jar  checker-compat-qual-2.0.0.jar	guava-24.1.1-jre.jar	       j2objc-annotations-1.1.jar     jetty-util-9.4.40.v20210413.jar
aws-java-sdk-bundle-1.11.704.jar     commons-codec-1.11.jar		hadoop-azure-3.3.1.jar	       jackson-core-2.9.4.jar	      jetty-util-ajax-9.4.40.v20210413.jar
azure-keyvault-core-1.2.4.jar	     commons-lang3-3.4.jar		hadoop-shaded-guava-1.1.1.jar  jackson-core-asl-1.9.13.jar    jsr305-1.3.9.jar
azure-storage-8.6.6.jar		     commons-logging-1.1.3.jar		httpclient-4.5.13.jar	       jackson-mapper-asl-1.9.13.jar  slf4j-api-1.7.12.jar
cache				     error_prone_annotations-2.1.3.jar	httpcore-4.4.13.jar	       jars			      wildfly-openssl-1.0.7.Final.jar

$ cat spark-defaults.conf
spark.jars.ivy /opt/bitnami/spark/jars/ivy

$ cat docker-compose.yml
version: '2'

services:
  spark:
    image: bitnami/spark
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
    volumes:
      - ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
      - ./jars_dir:/opt/bitnami/spark/jars/ivy:z
  spark-worker:
    image: bitnami/spark
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
      - ./jars_dir:/opt/bitnami/spark/jars/ivy:z

You are getting issues because user 1001 doesn't have write permissions in opt/bitnami/spark and this would be the recommended way to solve it. An alternative would be to run the container as root, though it is not advisable.

FraPazGal avatar Apr 18 '22 15:04 FraPazGal

FWIW, rather than all those curl commands, you could use mvn dependency:get , which is effectively what --packages should be doing

OneCricketeer avatar Apr 19 '22 22:04 OneCricketeer

Hello @porscheme,

As you said, your error could be related to the permissions configuration this container uses. I tested downloading the jars in a local directory which I mounted in the container and the command you provided run successfully. You will need to create the local directory where the jars will be downloaded as explained in #7 (comment)

The files and the docker-compose configuration I used were the following:

$ ls -la
total 8
drwxr-xr-x  5 fdepaz  160 Apr 18 17:18 .
drwxr-xr-x 50 fdepaz 1600 Apr 18 16:24 ..
-rw-r--r--  1 fdepaz  918 Apr 18 17:12 docker-compose.yml
drwxrwxrwx 27 fdepaz  864 Apr 18 17:14 jars_dir
-rw-r--r--  1 fdepaz   43 Apr 18 16:40 spark-defaults.conf

$ ls jars_dir
animal-sniffer-annotations-1.14.jar  checker-compat-qual-2.0.0.jar	guava-24.1.1-jre.jar	       j2objc-annotations-1.1.jar     jetty-util-9.4.40.v20210413.jar
aws-java-sdk-bundle-1.11.704.jar     commons-codec-1.11.jar		hadoop-azure-3.3.1.jar	       jackson-core-2.9.4.jar	      jetty-util-ajax-9.4.40.v20210413.jar
azure-keyvault-core-1.2.4.jar	     commons-lang3-3.4.jar		hadoop-shaded-guava-1.1.1.jar  jackson-core-asl-1.9.13.jar    jsr305-1.3.9.jar
azure-storage-8.6.6.jar		     commons-logging-1.1.3.jar		httpclient-4.5.13.jar	       jackson-mapper-asl-1.9.13.jar  slf4j-api-1.7.12.jar
cache				     error_prone_annotations-2.1.3.jar	httpcore-4.4.13.jar	       jars			      wildfly-openssl-1.0.7.Final.jar

$ cat spark-defaults.conf
spark.jars.ivy /opt/bitnami/spark/jars/ivy

$ cat docker-compose.yml
version: '2'

services:
  spark:
    image: bitnami/spark
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
    volumes:
      - ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
      - ./jars_dir:/opt/bitnami/spark/jars/ivy:z
  spark-worker:
    image: bitnami/spark
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ./spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
      - ./jars_dir:/opt/bitnami/spark/jars/ivy:z

You are getting issues because user 1001 doesn't have write permissions in opt/bitnami/spark and this would be the recommended way to solve it. An alternative would be to run the container as root, though it is not advisable.

Thanks @FraPazGal , I would love to test it out. How do we do this using Helm Charts?

porscheme avatar Apr 19 '22 22:04 porscheme

using Helm Charts

Create a persistent volume with JARs or add them to a custom image like you're already doing

OneCricketeer avatar Apr 19 '22 22:04 OneCricketeer

Hello @porscheme,

As explained by @OneCricketeer, you would need to use an initContainer where you will download the .jar files and a persistent volume to store said .jar. After downloading them, you can mount the used volume in the pod's main container (spark).

If you have any issues applying the solution in our chart, please do not doubt to open a ticket in our bitnami/charts repo so we can further help you.

FraPazGal avatar Apr 21 '22 14:04 FraPazGal

Hello @porscheme,

As explained by @OneCricketeer, you would need to use an initContainer where you will download the .jar files and a persistent volume to store said .jar. After downloading them, you can mount the used volume in the pod's main container (spark).

If you have any issues applying the solution in our chart, please do not doubt to open a ticker in our bitnami/charts repo so we can further help you.

Thanks @FraPazGal and @OneCricketeer for the reply.

  1. Does it mean your official documentation for additional JAR files doesn't work or did I mis understand the docs? Below is an extract from the docs
Installing additional jars
By default, this container bundles a generic set of jar files but the default image can be extended to add as many jars as needed for your specific use case. For instance, the following Dockerfile adds [aws-java-sdk-bundle-1.11.704.jar](https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.704):

FROM bitnami/spark
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.704/aws-java-sdk-bundle-1.11.704.jar --o
  1. Yes, I can try volume approach

porscheme avatar Apr 21 '22 15:04 porscheme

Hi @porscheme,

Indeed, the current approach explained in the documentation is not working. I have opened an internal task to investigate whether the documentation is obsolete or the approach should still be valid and we need to make some changes to the container. I'll leave this ticket on-hold so we can notify any news regarding this in the future.

FraPazGal avatar Apr 25 '22 09:04 FraPazGal

We are going to transfer this issue to bitnami/containers

In order to unify the approaches followed in Bitnami containers and Bitnami charts, we are moving some issues in bitnami/bitnami-docker-<container> repositories to bitnami/containers.

Please follow bitnami/containers to keep you updated about the latest bitnami images.

More information here: https://blog.bitnami.com/2022/07/new-source-of-truth-bitnami-containers.html

carrodher avatar Jul 28 '22 13:07 carrodher

We add jars under native user and everything works fine.

USER 1001

# Azure SDKs, Hadoop-Azure, Delta, Wildfly SSL, common jars (delta + azure hadoop)
RUN wget --quiet https://repo1.maven.org/maven2/com/microsoft/azure/adal4j/1.6.7/adal4j-1.6.7.jar -O /opt/bitnami/spark/jars/adal4j-1.6.7.jar \
&& wget --quiet https://repo1.maven.org/maven2/com/microsoft/azure/azure-annotations/1.10.0/azure-annotations-1.10.0.jar -O /opt/bitnami/spark/jars/azure-annotations-1.10.0.jar \
...

Spark doesn't run as root in bitnami image, unless you override it, so it can't access the jars you download under root. And in general, don't run containers as root :)

george-zubrienko avatar Aug 26 '22 08:08 george-zubrienko

@ porscheme Did you try what @george-zubrienko suggests?

recena avatar Oct 14 '22 16:10 recena

Hi @porscheme,

We have changed and checked the documentation. As @george-zubrienko says, user 1001 is the correct user to download the JAR files. So you can already see the modules in use:

        . . .
:: resolution report :: resolve 31609ms :: artifacts dl 1917ms
        :: modules in use:
        com.fasterxml.jackson.core#jackson-core;2.9.4 from central in [default]
        com.google.code.findbugs#jsr305;1.3.9 from central in [default]
        com.google.errorprone#error_prone_annotations;2.1.3 from central in [default]
        com.google.guava#guava;24.1.1-jre from central in [default]
        com.google.j2objc#j2objc-annotations;1.1 from central in [default]
        com.microsoft.azure#azure-keyvault-core;1.2.4 from central in [default]
        com.microsoft.azure#azure-storage;8.6.6 from central in [default] # <<-----------------
        commons-codec#commons-codec;1.11 from central in [default]
        commons-logging#commons-logging;1.1.3 from central in [default]
        org.apache.commons#commons-lang3;3.4 from central in [default]
        org.apache.hadoop#hadoop-azure;3.3.1 from central in [default] # <<-----------------
        org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1 from central in [default]
        . . .

If you have any problems after this change, feel free to give your feedback. Thanks for your comments.

CeliaGMqrz avatar Oct 17 '22 14:10 CeliaGMqrz