scio
scio copied to clipboard
Rework Hadoop Configuration defaults in Scio
Right now Scio overrides core-site.xml
in a few modules (scio-parquet
, scio-smb
). This gets picked up by library consumers, and if they wish, they can specify their own core-site.xml
file in src/main/resources/core-site.xml
, which will supersede Scio's settings entirely (Hadoop will choose the users's settings file and ignore Scio's, even if they define completely different property keys).
This can be good (easy to use) and bad because the user might not be aware that by defining their own settings file, they're throwing away Scio's reasonable default settings (i.e. block size, fadvise).
The way Hadoop handles this is that for different modules, the library defines a file {module}-default.xml
file and the library user defines a {module}-site.xml
. This pattern is repeated in a few places (core-default.xml vs core-site.xml, for example, or mapred-default.xml vs mapred-site.xml).
Maybe in Scio we could switch from overriding core-site.xml
to overriding a new file, scio-default.xml
, and allowing users to override a scio-site.xml
file. We'd have to add a call to Configuration#addDefaultResource, for both scio-default.xml and scio-site.xml, in both scio-parquet and scio-smb modules.
Downside: more Hadoop library code in Scio. Upside: greater flexibility to roll out configuration options and potentially less confusion over expected behavior.