containers
containers copied to clipboard
[bitnami/spark] Problem with python code to creation of parquet file
Name and Version
bitnami/spark
What architecture are you using?
amd64
What steps will reproduce the bug?
I test below python code with docker compose and bitnami image and the result was the same fault in creation of *.parquert file:
csv read success:
parquet file creation failure:
docker-compose.yml :
version: '3.6'
services:
spark:
container_name: spark
image: bitnami/spark:latest
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- 127.0.0.1:8081:8080
spark-worker:
image: bitnami/spark:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=2
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
docker run :
docker-compose up --scale spark-worker=2
ctp.py :
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WritingParquet").getOrCreate()
df = spark.read.option("header", True).csv("csv/file.csv")
df.show()
df.write.mode('overwrite').parquet("a.parquet")
spark submit :
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://35368355157f:7077 csv/ctp.py
please help me 👍
What is the expected behavior?
No response
What do you see instead?
creation of a.parquet folder without *.parquet file
Additional information
No response
Does the code show any kind of error that suggests there's an issue in the Bitnami packaging of Spark? It is not clear to me if the issue is in the Bitnami packaging or the use of Spark itself.
Thanks for your response, There was no any error at runtime, I tested my code with "apache/spark-py" image (https://hub.docker.com/r/apache/spark-py) and the result was correct and the parquet file was created but with bitnami/spark image we can read csv data but the parquet file is not created. (based on pictures attached above)
I tested the python code for saving dataframe to json format, but the result was the same problem as I mentioned before :
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WritingJson").getOrCreate()
df2 = spark.createDataFrame([(1, "Alice", 10),
(2, "Bob", 20),
(3, "Charlie", 30)],
["id", "name", "age"])
df2.show()
df2.write.mode('overwrite').json('file_name.json')
please say something helpfull.
with scala shell (spark-shell), everything is ok.
val df = spark.read.csv("csv/file.csv")
df.write.mode("overwrite").format("json").save("file_name.json")
but with pyspark and spark-submit python code file not found !
I tested the java code for saving dataframe to json format, but the result was the same problem as I mentioned before :
package arka;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ctjson {
public static void main(String[] args) {
SparkSession SPARK_SESSION = SparkSession.builder().appName("Mahla ctjson")
.master("spark://6fe9e36ddaa9:7077")
.getOrCreate();
Dataset<Row> df = SPARK_SESSION.read().option("inferSchema", "true")
.option("header", "true")
.csv("csv/file.csv");
df.show();
df.printSchema();
df.write().mode("overwrite").format("json").save("file_name.json");
}
}
pom.xml :
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mahla</groupId>
<artifactId>arka</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>csvtojson</name>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>
jar file : ctj.zip
submit command :
./bin/spark-submit --class arka.ctjson --master spark://6fe9e36ddaa9:7077 csv/ctj.jar
Could you please check the issue.
Hi @kayvansol,
Could you please provide the specific commands you are executing to replicate the issue, along with the corresponding logs from container initialization to the point where the problem occurs?
Thank you!
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.