oci-hdfs-connector
oci-hdfs-connector copied to clipboard
WARNING log not clearly when writing parquet with spark in scala
Hi, team.
I'm using oci-hdfs-connector and spark with sbt as below config:
name := "ScalaSparkProject"
version := "0.1"
scalaVersion := "2.12.15"
libraryDependencies ++= Seq(
"com.oracle.oci.sdk" % "oci-hdfs-connector" % "3.3.4.1.0.0",
"org.apache.spark" %% "spark-core" % "3.4.0",
"org.apache.spark" %% "spark-sql" % "3.4.0"
)
javaOptions ++= Seq(
"-XX:+IgnoreUnrecognizedVMOptions",
"--add-opens=java.base/java.lang=ALL-UNNAMED",
"--add-opens=java.base/java.lang.invoke=ALL-UNNAMED",
"--add-opens=java.base/java.lang.reflect=ALL-UNNAMED",
"--add-opens=java.base/java.io=ALL-UNNAMED",
"--add-opens=java.base/java.net=ALL-UNNAMED",
"--add-opens=java.base/java.nio=ALL-UNNAMED",
"--add-opens=java.base/java.util=ALL-UNNAMED",
"--add-opens=java.base/java.util.concurrent=ALL-UNNAMED",
"--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED",
"--add-opens=java.base/sun.nio.ch=ALL-UNNAMED",
"--add-opens=java.base/sun.nio.cs=ALL-UNNAMED",
"--add-opens=java.base/sun.security.action=ALL-UNNAMED",
"--add-opens=java.base/sun.util.calendar=ALL-UNNAMED",
"--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
)
ThisBuild / fork := true
My code is simple as below:
package com.my_project.test
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{Column, SaveMode, SparkSession}
import org.apache.spark.sql.functions.{to_date, typedLit, when}
object MyTest {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession
.builder()
.config("spark.sql.files.maxPartitionBytes", "4120000000")
.config("spark.sql.parquet.int96RebaseModeInRead", "LEGACY")
.config("spark.sql.parquet.int96RebaseModeInWrite", "LEGACY")
.config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
.config("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")
.config("spark.local.dir", "/u01/spark_temp")
.config("spark.hadoop.fs.oci.client.auth.tenantId", "<tenantId>"
.config("spark.hadoop.fs.oci.client.auth.userId", "<userId>")
.config("spark.hadoop.fs.oci.client.auth.fingerprint", "<fingerprint>")
.config("spark.hadoop.fs.oci.client.auth.pemfilepath", "</path/to/file_pem>")
.config("spark.hadoop.fs.oci.client.region", "<region>")
.master("local[20]")
.getOrCreate()
val data = spark.read.parquet("oci://my_bucket@<namespace>/my_bucket/input_data")
data.write.mode(SaveMode.Overwrite).parquet("oci://my_bucket@<namespace>/output_data")
}
}
I caught warning:
WARN JerseyHttpRequest: Stream size to upload is 0 bytes, this could potentially lead to data corruption. If this is not intended, please make sure all the OCI SDK dependencies point to the same version
I just read/write parquet normally not streaming data. In which case does this warning disappear ?
Thanks in advanced for suggestion!