TFRecords File is too big! 10X the size of parquet
See similar git issues here:-- https://github.com/tensorflow/ecosystem/issues/61#issuecomment-363577011 https://github.com/tensorflow/ecosystem/issues/61 https://github.com/tensorflow/ecosystem/issues/106
This how I'm writing a PySpark dataframe to tf-records to an S3 bucket:---
s3_path = "s3://Shuks/dataframe_tf_records"
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(s3_path)
This creates a new key/"directory" on S3 with the following path : s3://Shuks/dataframe_tf_records/ And under this directory are all the tf-records.
How do I specify compression type during conversion?
try this:
option("codec","org.apache.hadoop.io.compress.GzipCodec")
try this:
option("codec","org.apache.hadoop.io.compress.GzipCodec")I use this method,data.repartition(50).write.mode("overwrite").format('tfrecords').option("codec", "org.apache.hadoop.io.compress.GzipCodec").save(path), but the file seems not to be small. the option did not take effect.