spark-tfrecord icon indicating copy to clipboard operation
spark-tfrecord copied to clipboard

Dataframe can not convert to tfrecord

Open Brian1203-zz opened this issue 4 years ago • 3 comments

val df1: DataFrame = spark.createDataFrame(rdd,subSchema)
val df2 = df1.withColumn("entity",struct("age","salary")).
  groupBy("employee_name")

df1 can convert to tfrecord,but df2 cannot .

why is that?

Brian1203-zz avatar Dec 23 '21 03:12 Brian1203-zz

df2 is the output of the groupBy operation, which is "RelationalGroupedDataset". There is no TFRecord equivalence for this kind of data schema. TFRecord supports very limited schema. It is for TF model training only.

junshi15 avatar Dec 23 '21 03:12 junshi15

May i ask how can i design my schema like this 。 I want convert RDD into DataFrame use StructType , inside of groupBy opration. root |-- name: string (nullable = true) |-- age: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _1: string (nullable = true) | | |-- _2: string (nullable = true) | | |-- _3: integer (nullable = false)

+-------------+--------------------+ |employee_name|collect_list(entity)| +-------------+--------------------+ | Jen | [[53, 79000]]| | Michae l |[[56, 86000], [30...| | Kumar | [[34, 90000], [50...| | Maria | [[24, 90000]]| | Raman | [[40, 99000], [36...| | Jeff | [[25, 80000]]| +-------------+--------------------+

Brian1203-zz avatar Dec 23 '21 07:12 Brian1203-zz

If you already have the RDD, then you can create the schema, then use createDataFrame as shown here.

https://stackoverflow.com/questions/29383578/how-to-convert-rdd-object-to-dataframe-in-spark

junshi15 avatar Dec 23 '21 14:12 junshi15