spark-tfrecord icon indicating copy to clipboard operation
spark-tfrecord copied to clipboard

TFRecords File is too big! 10X the size of parquet

Open kart2k15 opened this issue 3 years ago • 2 comments

See similar git issues here:-- https://github.com/tensorflow/ecosystem/issues/61#issuecomment-363577011 https://github.com/tensorflow/ecosystem/issues/61 https://github.com/tensorflow/ecosystem/issues/106

This how I'm writing a PySpark dataframe to tf-records to an S3 bucket:---

s3_path = "s3://Shuks/dataframe_tf_records"   
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(s3_path)

This creates a new key/"directory" on S3 with the following path : s3://Shuks/dataframe_tf_records/ And under this directory are all the tf-records.

How do I specify compression type during conversion?

kart2k15 avatar Mar 09 '22 18:03 kart2k15

try this: option("codec","org.apache.hadoop.io.compress.GzipCodec")

junshi15 avatar Apr 08 '22 05:04 junshi15

try this: option("codec","org.apache.hadoop.io.compress.GzipCodec") I use this method, data.repartition(50).write.mode("overwrite").format('tfrecords').option("codec", "org.apache.hadoop.io.compress.GzipCodec").save(path), but the file seems not to be small. the option did not take effect.

sosixyz avatar May 22 '24 06:05 sosixyz