oci-hdfs-connector icon indicating copy to clipboard operation
oci-hdfs-connector copied to clipboard

WARNING log not clearly when writing parquet with spark in scala

Open lutuantai95 opened this issue 11 months ago • 0 comments

Hi, team.

I'm using oci-hdfs-connector and spark with sbt as below config:

name := "ScalaSparkProject"

version := "0.1"

scalaVersion := "2.12.15"

libraryDependencies ++= Seq(
  "com.oracle.oci.sdk" % "oci-hdfs-connector" % "3.3.4.1.0.0",
  "org.apache.spark" %% "spark-core" % "3.4.0",
  "org.apache.spark" %% "spark-sql" % "3.4.0"
)

javaOptions ++= Seq(
  "-XX:+IgnoreUnrecognizedVMOptions",
  "--add-opens=java.base/java.lang=ALL-UNNAMED",
  "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED",
  "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED",
  "--add-opens=java.base/java.io=ALL-UNNAMED",
  "--add-opens=java.base/java.net=ALL-UNNAMED",
  "--add-opens=java.base/java.nio=ALL-UNNAMED",
  "--add-opens=java.base/java.util=ALL-UNNAMED",
  "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED",
  "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED",
  "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED",
  "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED",
  "--add-opens=java.base/sun.security.action=ALL-UNNAMED",
  "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED",
  "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
)

ThisBuild / fork := true

My code is simple as below:

package com.my_project.test

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{Column, SaveMode, SparkSession}
import org.apache.spark.sql.functions.{to_date, typedLit, when}

object MyTest {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)
    val spark = SparkSession
      .builder()
      .config("spark.sql.files.maxPartitionBytes", "4120000000")
      .config("spark.sql.parquet.int96RebaseModeInRead", "LEGACY")
      .config("spark.sql.parquet.int96RebaseModeInWrite", "LEGACY")
      .config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
      .config("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")
      .config("spark.local.dir", "/u01/spark_temp")
      .config("spark.hadoop.fs.oci.client.auth.tenantId", "<tenantId>"
      .config("spark.hadoop.fs.oci.client.auth.userId", "<userId>")
      .config("spark.hadoop.fs.oci.client.auth.fingerprint", "<fingerprint>")
      .config("spark.hadoop.fs.oci.client.auth.pemfilepath", "</path/to/file_pem>")
      .config("spark.hadoop.fs.oci.client.region", "<region>")
      .master("local[20]")
      .getOrCreate()

    val data = spark.read.parquet("oci://my_bucket@<namespace>/my_bucket/input_data")
    
    data.write.mode(SaveMode.Overwrite).parquet("oci://my_bucket@<namespace>/output_data")
  }
}

I caught warning:

WARN JerseyHttpRequest: Stream size to upload is 0 bytes, this could potentially lead to data corruption. If this is not intended, please make sure all the OCI SDK dependencies point to the same version

I just read/write parquet normally not streaming data. In which case does this warning disappear ?

Thanks in advanced for suggestion!

lutuantai95 avatar Dec 04 '24 09:12 lutuantai95