spark-tfrecord icon indicating copy to clipboard operation
spark-tfrecord copied to clipboard

Consume TensorFlowInferSchema from Java Spark 2.4.0

Open yeshbash opened this issue 4 years ago • 2 comments

This is not an issue and is more of a request for guidance. I'd like to just use the TensorFlowInferSchema functionality in my Java spark job. Below is the sample snippet

JavaRDD<Example> exampleRdd =
          jsc.parallelize(
              Arrays.asList(
                  Example.newBuilder()
                      .setFeatures(
                          Features.newBuilder()
                              .putFeature(
                                  "Test",
                                  Feature.newBuilder()
                                      .setInt64List(Int64List.newBuilder().addValue(10).build())
                                      .build())
                              .build())
                      .build()));
StructType schema = TensorFlowInferSchema.apply(rdd.rdd(), ???);

However, the apply method expects a second argument - implicit evidence$1 : scala.reflect.runtime.universe.TypeTag[T].

Below is the compiled Java class

def apply[T](rdd : org.apache.spark.rdd.RDD[T])(implicit evidence$1 : scala.reflect.runtime.universe.TypeTag[T]) : org.apache.spark.sql.types.StructType = { /* compiled code */ }

I'm not familiar with Scala and it appears that it's not possible to create TypeTag[Example] in java.

Appreciate if you could share your thoughts on the below

  • Is it safe (and possible) to just consume TensorFlowInferSchema in Java without going via spark.read.format("tfrecord").option("recordType", "Example")?
  • What to pass for the TypeTag[T] argument?
  • Is there a Java example for this use case

Thanks for maintaining this project and highly appreciate your help!

yeshbash avatar Sep 04 '21 03:09 yeshbash

"TensorFlowInferSchema" only infers the schema. Is that what you need? I am not familiar with Spark Java API, but can you call scala function from Java? If you want to read/write TFRecord with Java API, I assume something like this will work. Dataset<Row> usersDF = spark.read().format("tfrecord").load("examples/src/main/resources/users.tfrecord"); usersDF.select("name", "favorite_color").write().format("tfrecord").save("namesAndFavColors.tfrecord");

junshi15 avatar Sep 04 '21 07:09 junshi15

Thanks for the example on loading - Yes, that should help to read tfrecord files into a DataFrame. However, as you point out, I just need automatic schema inference as we already build a JavaRDD<Example>. My understanding is Java classes can consume Scala classes and vice versa but I'm not familiar enough to understand if TensorFlowInferSchema can be invoked. I was hoping to get some help there.

Thanks!

yeshbash avatar Sep 04 '21 14:09 yeshbash