scio icon indicating copy to clipboard operation
scio copied to clipboard

Don't load default Configuration resources

Open clairemcginty opened this issue 1 year ago • 0 comments

By default, calling new Configuration() loads from any available core-site.xml and core-default.xml. Scio contains a minimalistic core-site.xml implementation, but core-default.xml as loaded from Hadoop is quite long: https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-common/core-default.xml

Since the Configuration is part of the serialized job graph for Parquet reads/writes, we should reconsider if we actually need the contents of core-default.xml. It contributes quite a lot to the serialized byte size:

def getLengthAndByteSize(conf: Configuration): (Int, Int) = {
  val baos = new ByteArrayOutputStream()
  val oeos = new ObjectEncoderOutputStream(baos)
  new SerializableConfiguration(conf).writeExternal(oeos)

  (conf.iterator().asScala.toList.size, baos.size())
}

// Test default Configuration
getLengthAndByteSize(new Configuration())
(258, 22880) // 258 entries and a byte size of 22.88 KB

// Test Configuration without core-default.xml loaded
val nonDefaultConfiguration = new Configuration(false)
nonDefaultConf.addResource("core-site.xml") // Still need Scio's core-site.xml

getLengthAndByteSize(nonDefaultConfiguration)
(4, 377) // 4 entries and a byte size of 377

another option is to use some kind of string encoding on the SerializableConfiguration in Beam, maybe gzip?

clairemcginty avatar Apr 17 '23 20:04 clairemcginty