containers icon indicating copy to clipboard operation
containers copied to clipboard

[bitnami/spark] Problem with python code to creation of parquet file

Open kayvansol opened this issue 10 months ago • 6 comments

Name and Version

bitnami/spark

What architecture are you using?

amd64

What steps will reproduce the bug?

I test below python code with docker compose and bitnami image and the result was the same fault in creation of *.parquert file:

csv read success:

readsuccess

parquet file creation failure:

parquetErr

docker-compose.yml :

version: '3.6'

services:

  spark:
    container_name: spark
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark   
    ports:
      - 127.0.0.1:8081:8080
    

  spark-worker:
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

docker run :

docker-compose up --scale spark-worker=2

ctp.py :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingParquet").getOrCreate()

df = spark.read.option("header", True).csv("csv/file.csv")

df.show()

df.write.mode('overwrite').parquet("a.parquet")

spark submit :

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://35368355157f:7077 csv/ctp.py

please help me 👍

What is the expected behavior?

No response

What do you see instead?

creation of a.parquet folder without *.parquet file

Additional information

No response

kayvansol avatar Apr 02 '24 22:04 kayvansol

Does the code show any kind of error that suggests there's an issue in the Bitnami packaging of Spark? It is not clear to me if the issue is in the Bitnami packaging or the use of Spark itself.

javsalgar avatar Apr 03 '24 08:04 javsalgar

Thanks for your response, There was no any error at runtime, I tested my code with "apache/spark-py" image (https://hub.docker.com/r/apache/spark-py) and the result was correct and the parquet file was created but with bitnami/spark image we can read csv data but the parquet file is not created. (based on pictures attached above)

kayvansol avatar Apr 03 '24 10:04 kayvansol

I tested the python code for saving dataframe to json format, but the result was the same problem as I mentioned before :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingJson").getOrCreate()

df2 = spark.createDataFrame([(1, "Alice", 10),
                            (2, "Bob", 20),
                            (3, "Charlie", 30)], 
                            ["id", "name", "age"])


df2.show()

df2.write.mode('overwrite').json('file_name.json')

jsonErr

please say something helpfull.

kayvansol avatar Apr 03 '24 22:04 kayvansol

with scala shell (spark-shell), everything is ok.

val df = spark.read.csv("csv/file.csv")

df.write.mode("overwrite").format("json").save("file_name.json")

jsonScala

jsonScalaFile

but with pyspark and spark-submit python code file not found !

kayvansol avatar Apr 04 '24 17:04 kayvansol

I tested the java code for saving dataframe to json format, but the result was the same problem as I mentioned before :

Javacsvread

JavacsvreadSchema

JavacsvreadNoFile

package arka;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ctjson {

	public static void main(String[] args) {

		SparkSession SPARK_SESSION = SparkSession.builder().appName("Mahla ctjson")
				.master("spark://6fe9e36ddaa9:7077")
				.getOrCreate();

		Dataset<Row> df = SPARK_SESSION.read().option("inferSchema", "true")
				.option("header", "true")
				.csv("csv/file.csv");

		df.show();

		df.printSchema();
		
		df.write().mode("overwrite").format("json").save("file_name.json");
		
	}
}

pom.xml :

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.mahla</groupId>
	<artifactId>arka</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>csvtojson</name>

	<dependencies>

		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.12</artifactId>
			<version>3.5.1</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-sql_2.12</artifactId>
			<version>3.5.1</version>
			<scope>provided</scope>
		</dependency>
		
	</dependencies>

</project>

jar file : ctj.zip

submit command :

./bin/spark-submit --class arka.ctjson --master spark://6fe9e36ddaa9:7077 csv/ctj.jar

Could you please check the issue.

kayvansol avatar Apr 05 '24 14:04 kayvansol

Hi @kayvansol,

Could you please provide the specific commands you are executing to replicate the issue, along with the corresponding logs from container initialization to the point where the problem occurs?

Thank you!

fevisera avatar Apr 11 '24 07:04 fevisera

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] avatar May 08 '24 01:05 github-actions[bot]

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

github-actions[bot] avatar May 13 '24 01:05 github-actions[bot]