spark-tfrecord icon indicating copy to clipboard operation
spark-tfrecord copied to clipboard

tfrecord write results in no data but no error

Open dennisobrien opened this issue 3 years ago • 2 comments

Hi -- I am trying to use spark-tfrecord with Spark 3.1.2, but the files written have no data.

  • Spark 3.1.2
  • Python 3.8.10
  • Java 1.8.0
  • Scala 2.12.10

I'm using the latest version available from the maven repo as:

<dependency>
    <groupId>com.linkedin.sparktfrecord</groupId>
    <artifactId>spark-tfrecord_2.12</artifactId>
    <version>0.3.4</version>
</dependency>

Following the pyspark example from the README but simplified further:

path = "/tmp/test-output.tfrecord"

fields = [
    StructField("a", IntegerType()),
    StructField("b", FloatType()),
    StructField("c", StringType()),
]
schema = StructType(fields)
test_rows = [
    [1, 0.5, 'x'],
    [2, 1.5, 'y'],
    [3, 2.5, 'z'],
]
rdd = spark.sparkContext.parallelize(test_rows)
df = spark.createDataFrame(rdd, schema)
df.show()

Outputs:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|0.5|  x|
|  2|1.5|  y|
|  3|2.5|  z|
+---+---+---+

Saving the spark dataframe to tfrecord does not throw an error.

path = "/tmp/test-output.tfrecord/"
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(path)

But the directory only has a _SUCCESS flag and a crc file, no data.

ls -la /tmp/test-output.tfrecord/
total 12
drwxr-xr-x.  2 build build 4096 Feb 19 19:00 .
drwxrwxrwx. 11 root  root  4096 Feb 19 19:00 ..
-rw-r--r--.  1 build build    0 Feb 19 19:00 _SUCCESS
-rw-r--r--.  1 build build    8 Feb 19 19:00 ._SUCCESS.crc

And of course, trying to read the file fails.

spark.read.format('tfrecord').option('recordType', 'Example').load(path).show()

Error:

AnalysisException: Unable to infer schema for TFRECORD. It must be specified manually.

Let me know if there is more system/config information that could help to debug this.

FWIW, I had the exact same situation when testing spark-tensorflow-connector which I was building from source. I figured there was something wrong with my dependencies or something and thought I would try this project.

thanks, Dennis

dennisobrien avatar Feb 20 '22 03:02 dennisobrien

I am also running into this same problem, with the same error - writes no data, but raises no error message. Write path only has _SUCCESS and ._SUCCESS.crc files. Everything works as expected on a GPU instance but it fails to write data on a CPU instance.

Here are my details:

Spark: 3.5.0 Java: Zulu 8.78.0.19-CA-linux64 Python: 3.11.0 Scala: 2.12.18 tensorflow: 2.16.1

kpfoley avatar Aug 12 '24 15:08 kpfoley

@kpfoley I tried the code above with Spark 3.5.0 and spark-tfrecord_2.12:0.7.0. It worked fine on my macbook pro (part files were generated).

pyspark --packages com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.7.0

junshi15 avatar Aug 13 '24 04:08 junshi15