spark-tfrecord
spark-tfrecord copied to clipboard
tfrecord write results in no data but no error
Hi -- I am trying to use spark-tfrecord
with Spark 3.1.2, but the files written have no data.
- Spark 3.1.2
- Python 3.8.10
- Java 1.8.0
- Scala 2.12.10
I'm using the latest version available from the maven repo as:
<dependency>
<groupId>com.linkedin.sparktfrecord</groupId>
<artifactId>spark-tfrecord_2.12</artifactId>
<version>0.3.4</version>
</dependency>
Following the pyspark example from the README but simplified further:
path = "/tmp/test-output.tfrecord"
fields = [
StructField("a", IntegerType()),
StructField("b", FloatType()),
StructField("c", StringType()),
]
schema = StructType(fields)
test_rows = [
[1, 0.5, 'x'],
[2, 1.5, 'y'],
[3, 2.5, 'z'],
]
rdd = spark.sparkContext.parallelize(test_rows)
df = spark.createDataFrame(rdd, schema)
df.show()
Outputs:
+---+---+---+
| a| b| c|
+---+---+---+
| 1|0.5| x|
| 2|1.5| y|
| 3|2.5| z|
+---+---+---+
Saving the spark dataframe to tfrecord does not throw an error.
path = "/tmp/test-output.tfrecord/"
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(path)
But the directory only has a _SUCCESS flag and a crc file, no data.
ls -la /tmp/test-output.tfrecord/
total 12
drwxr-xr-x. 2 build build 4096 Feb 19 19:00 .
drwxrwxrwx. 11 root root 4096 Feb 19 19:00 ..
-rw-r--r--. 1 build build 0 Feb 19 19:00 _SUCCESS
-rw-r--r--. 1 build build 8 Feb 19 19:00 ._SUCCESS.crc
And of course, trying to read the file fails.
spark.read.format('tfrecord').option('recordType', 'Example').load(path).show()
Error:
AnalysisException: Unable to infer schema for TFRECORD. It must be specified manually.
Let me know if there is more system/config information that could help to debug this.
FWIW, I had the exact same situation when testing spark-tensorflow-connector
which I was building from source. I figured there was something wrong with my dependencies or something and thought I would try this project.
thanks, Dennis
I am also running into this same problem, with the same error - writes no data, but raises no error message. Write path only has _SUCCESS and ._SUCCESS.crc files. Everything works as expected on a GPU instance but it fails to write data on a CPU instance.
Here are my details:
Spark: 3.5.0 Java: Zulu 8.78.0.19-CA-linux64 Python: 3.11.0 Scala: 2.12.18 tensorflow: 2.16.1
@kpfoley I tried the code above with Spark 3.5.0 and spark-tfrecord_2.12:0.7.0. It worked fine on my macbook pro (part files were generated).
pyspark --packages com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.7.0