ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

RFC: Nesting Refactor for Encoder/Decoder Config

Open connor-mccorm opened this issue 3 years ago • 7 comments

In the push to create a fully fledged schema using Marshmallow, we've run into a bit of an issue. When specifying things like optimizers, the structure of the config works well in a nested structure like this:

trainer:
     learning_rate: 0.001
     optimizer:
          type: adam
          beta_1: 0.9
          beta_2: 0.999

This works great for the structure of the schema since it allows us to create a hierarchical set of modules and submodules that can help build out the schema object which can be used internally for consistency in Ludwig, for populating parameter info on the frontend, and also for building the config object in the SDK. The issue here lies in the config structure for Encoders and Decoders. Currently the way we specify encoder details in the config is without using any nesting and just adding any additional encoder specific parameters at the same level. For example:

input_features:
    - name: Feature_1
       type: binary
       encoder: dense
       dropout: 0.2
       activation: leakyRelu
    - name: Feature_2
       type: text
       encoder: stacked_cnn
       dropout: 0.2
       reduce_output: mean
       embeddings_on_cpu: false
       strides: 5
       output_size: 128

The issue here is that the encoder parameters are nested at the input feature level which is an issue for a few reasons:

  • The code required to validate a schema without a nested module structure becomes very messy and convoluted which will only become more of a problem over time with the addition of new encoders and our value prop of extensibility.
  • Specifying encoder parameters in the future SDK config object will be complicated and unintuitive because you will have to add the specific encoder parameters to the input feature as opposed the the specific encoder you want to use.

Because of this, we are proposing a refactor of the encoder config structure that nests encoders an additional level to stay consistent with a module structure. For example:

Instead of this

input_features:
    - name: Feature_1
      type: binary
      encoder: dense
      dropout: 0.2
      activation: leakyRelu

We would do this

input_features:
    - name: Feature_1
      type: binary
      encoder: 
            type: dense
            dropout: 0.2
            activation: leakyRelu

Note this also applies for decoders as well. If we decide to move forward on this proposal, we would need to put in some logic to handle backwards compatibility for all the existing Ludwig users. While this may seem like a headache, I believe that this is something that we definitely want to do, and I think that doing it now as opposed to later is a good idea since the impact issues from a change like this will only increase over time. Let me know your thoughts!

connor-mccorm avatar Jun 09 '22 21:06 connor-mccorm

I'm not necessarily opposed to this change, but this would be a rather significant backwards compatibility layer that we'd need to maintain for a long time, so definitely want to verify that this is absolutely something we want to do.

  • Specifying encoder parameters in the future SDK config object will be complicated and unintuitive because you will have to add the specific encoder parameters to the input feature as opposed the the specific encoder you want to use.

We had a discussion about whether to flatten or nest for input features a while back. Here's @w4nderlust's comment:

Not nested encoder/decoder parameters: there’s merit to both indentation and not indentation. At the moment I’d like to keep it not indented. The main motivation being that there are some parameters in the config passed to the encoders that are added depending on data preprocessing (for instance the vocab size of a categorical embed encoder) and those may be needed by both encoders and decoders and potentially other functions within the input/output feature object and after trying for a few minutes to implement this there doesn’t seem a simple way not to break encapsulation without a substantial rework. None of this would be unsolvable, but would require more reworking than it seems at first glance, so I don’t think this is the right time for this additional reworking, so we should reconsider later on if we want to revisit this. From a PQL perspective, I think it’s fine to have encoder and decoder parameters on their own and then flatten them before passing to Ludwig. I believe the nesting is a better idea in general, don’t get me wrong, but would want to explore thoroughly nuances and corner case in the code and its impact on the config verbosity more in depth.

I think the idea is that the feature module is the sole owner of its encoder submodule. The fact that the same parameters are shared between a feature module that instantiates a submodule with those parameters is an implementation detail.

As Piero mentioned, there could be some additional (internal-only) parameters that are decided at preprocessing and then passed to the encoder/decoder submodule, and this is managed at the feature level -- but this is also an implementation detail.

By definition, then, there aren't any parameters (except for preprocessing which has its own subsection) that would be used exclusively at the feature level and not at the encoder submodule level.

  • The code required to validate a schema without a nested module structure becomes very messy and convoluted which will only become more of a problem over time with the addition of new encoders and our value prop of extensibility.

@ksbrar How difficult is it to validate a schema without the nested parameter structure? I imagine that we only need what we're already doing for the combiner, trainer, and optimizersections, which is to use the specialtype` parameter to dictate how to validate the rest of the declared parameters within the section, i.e.

combiner:
  type: concat
  <list of parameters that should be validated against the `ConcatCombiner` combiner schema>

optimizer:
  type: adamw
  <list of parameters that should be validated against the `adamw` optimizer schema>

trainer:
  type: GBM
  <list of parameters that should be validated against the `GBM` trainer schema>

Possibilities that would convince me that nesting is absolutely necessary:

  • There actually are parameters (except for preprocessing/internal) that would be used at the feature level and not the encoder submodule.
  • Validation is significantly less difficult if we do parameter nesting.
  • The nesting has structural value that has strong external value beyond what's reasonable to manage internally.

Open to more thoughts on this. @dantreiman @w4nderlust @tgaddair @brightsparc

justinxzhao avatar Jun 10 '22 19:06 justinxzhao

@ksbrar chime in here if I'm missing something, but my understanding is that schema validation without the nested parameter structure would just require a lot of marshmallow validates_schema() logic. Since we essentially would house all the possible encoder parameters from all encoders available to a certain input feature type in that same input feature schema, we would need a complex validation function that essentially makes sure that you are not only specifying the encoder params specific to a particular encoder type. While this is entirely possible, I foresee this getting pretty complicated in the future if we end up adding any additional encoders to certain feature types.

The bigger reason I see to upgrade personally is how this schema object that we're building out maps over to the SDK. Since we want to have a config object that the user interacts with the build their models, the way that our schema is structured is essentially what the customer will be touching when using the SDK. So not only do we want this to be consistent across modules, but also we want it to be intuitive to use. If we choose to not refactor the schema and nest the encoder/decoder structure, there are potential confusion issues with customers trying to specify encoders as part of their input feature and not knowing which items to include and which not to. I also know that something we support is extensibility for encoders and while this is an area where I'm not very familiar with, I feel like a nested structure would lend well to incorporating user defined encoders. For example:

input_features:
    - name: feature_1
       type: categorical
       encoder:
              type: user_defined_encoder
              param_1: ...
              param_2: ...

Something like this makes a lot of sense from a user perspective IMO. Additionally, I imagine down the road we could even be able to create adaptable logic that can validate the schema for a user defined encoder.

Anyways, I greatly appreciate your feedback and would love to hear others thoughts as well so we can decide whether we want to go through with this or not:)

connor-mccorm avatar Jun 10 '22 19:06 connor-mccorm

Different proposals:

# 1. This
LudwigConfig(input_features=[
   BinaryInputFeature(name="foo", encoder=Dense(num_layers=3), preprocessing=...)
])

# 2. No config changes option A
LudwigConfig(input_features=[
   BinaryInputDenseEncoder(name="foo", num_layers=3, preprocessing=...)
])

# 3. No config changes option B
LudwigConfig(input_features=[
    BinaryInputFeatuere(name="foo", encoder="dense", num_layers=3)
])

tgaddair avatar Jun 10 '22 21:06 tgaddair

Current thoughts:

  • Generally like the nested structure
  • Option 2 may be the best compromise for now
  • Worth investigation to see how hard option 1 would be, as it is the longterm solution

tgaddair avatar Jun 10 '22 21:06 tgaddair

Maybe also be worth thinking about how this overlays with current PQL which is inline with option 1. Also UX more closely matches this IMHO. My understanding on original push back when I raised this idea was the challenge of changing the Ludwig codebase to support this - could we perhaps confirm this again?

brightsparc avatar Jun 10 '22 21:06 brightsparc

@brightsparc we discussed the challenge of implementing the nested structure in terms of the refactoring effort. The consensus is that we should explore it and see where the complexity exists, if it does, and how much effort would be required to resolve it.

tgaddair avatar Jun 12 '22 15:06 tgaddair

I plan on scoping out the difficulty of a refactor after I have worked through option 2 to fill out the schema in the mean time.

connor-mccorm avatar Jun 14 '22 21:06 connor-mccorm