scio icon indicating copy to clipboard operation
scio copied to clipboard

Rework Hadoop Configuration defaults in Scio

Open clairemcginty opened this issue 1 year ago • 0 comments

Right now Scio overrides core-site.xml in a few modules (scio-parquet, scio-smb). This gets picked up by library consumers, and if they wish, they can specify their own core-site.xml file in src/main/resources/core-site.xml, which will supersede Scio's settings entirely (Hadoop will choose the users's settings file and ignore Scio's, even if they define completely different property keys).

This can be good (easy to use) and bad because the user might not be aware that by defining their own settings file, they're throwing away Scio's reasonable default settings (i.e. block size, fadvise).

The way Hadoop handles this is that for different modules, the library defines a file {module}-default.xml file and the library user defines a {module}-site.xml. This pattern is repeated in a few places (core-default.xml vs core-site.xml, for example, or mapred-default.xml vs mapred-site.xml).

Maybe in Scio we could switch from overriding core-site.xml to overriding a new file, scio-default.xml, and allowing users to override a scio-site.xml file. We'd have to add a call to Configuration#addDefaultResource, for both scio-default.xml and scio-site.xml, in both scio-parquet and scio-smb modules.

Downside: more Hadoop library code in Scio. Upside: greater flexibility to roll out configuration options and potentially less confusion over expected behavior.

clairemcginty avatar Jan 31 '23 15:01 clairemcginty