scio
scio copied to clipboard
Don't load default Configuration resources
By default, calling new Configuration()
loads from any available core-site.xml
and core-default.xml
. Scio contains a minimalistic core-site.xml implementation, but core-default.xml
as loaded from Hadoop is quite long: https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-common/core-default.xml
Since the Configuration
is part of the serialized job graph for Parquet reads/writes, we should reconsider if we actually need the contents of core-default.xml
. It contributes quite a lot to the serialized byte size:
def getLengthAndByteSize(conf: Configuration): (Int, Int) = {
val baos = new ByteArrayOutputStream()
val oeos = new ObjectEncoderOutputStream(baos)
new SerializableConfiguration(conf).writeExternal(oeos)
(conf.iterator().asScala.toList.size, baos.size())
}
// Test default Configuration
getLengthAndByteSize(new Configuration())
(258, 22880) // 258 entries and a byte size of 22.88 KB
// Test Configuration without core-default.xml loaded
val nonDefaultConfiguration = new Configuration(false)
nonDefaultConf.addResource("core-site.xml") // Still need Scio's core-site.xml
getLengthAndByteSize(nonDefaultConfiguration)
(4, 377) // 4 entries and a byte size of 377
another option is to use some kind of string encoding on the SerializableConfiguration in Beam, maybe gzip?