ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Encoder refactor

Open connor-mccorm opened this issue 2 years ago • 7 comments

PR to refactor Encoder/Decoder section of input/output features so that it is nested one level deeper. For example, an input feature that originally looked like this:

input_features:
      - name: image_feature_1
         type: image
         preprocessing:
              height: 7.5
              width: 7.5
              num_channels": 4
         encoder: resnet
         num_channels: 4
         dropout: 0.1
         resnet_size: 100

should now be defined like this:

input_features:
      - name: image_feature_1
         type: image
         preprocessing:
              height: 7.5
              width: 7.5
              num_channels": 4
         encoder: 
              type: resnet
              num_channels: 4
              dropout: 0.1
              resnet_size: 100

I have added some backwards compatibility logic and a test for it as well. Also since it is such a large PR, if you are an expert in a specific area of the codebase and don't have time to do the whole thing, feel free to just inject your expertise where it works best for you!

connor-mccorm avatar Jul 13 '22 01:07 connor-mccorm

Unit Test Results

       3 files   -     3         3 suites   - 3   55m 26s :stopwatch: - 1h 38m 39s 2 838 tests  - 124  2 683 :heavy_check_mark:  - 230    56 :zzz: +  7    99 :x: +  99  8 514 runs   - 372  8 014 :heavy_check_mark:  - 689  204 :zzz: +21  296 :x: +296 

For more details on these failures, see this check.

Results for commit df823d01. ± Comparison against base commit f145e467.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 13 '22 01:07 github-actions[bot]

Are these nesting changes being flowed down to the hyperopt settings when applied at the feature level?

Eg if we were to add sparse and dense choice for hyper opt for category features would it be like this:

input_features:
    - name: genres
      type: set
    - name: content_rating
      type: category
      encoder:
        type: dense
    - name: top_critic
      type: binary
    - name: runtime
      type: number
    - name: review_content
      type: text
      encoder: embed
output_features:
    - name: recommended
      type: binary
hyperopt:
  goal: maximize
  output_feature: recommended
  metric: accuracy
  split: validation
  parameters:
    review_content.embedding_size:
      space: choice
      categories: [128, 256]
    review_content.encoder: 
      type: 
        space: choice
        categories:[sparse, dense]

Or like this:

hyperopt:
  goal: maximize
  output_feature: recommended
  metric: accuracy
  split: validation
  parameters:
    review_content.embedding_size:
      space: choice
      categories: [128, 256]
    review_content.encoder.type: 
       space: choice
       categories:[sparse, dense]

brightsparc avatar Jul 23 '22 01:07 brightsparc

It would be the latter actually. Was actually testing this out today specifically in test_hyperopt_with_shared_params.

connor-mccorm avatar Jul 23 '22 01:07 connor-mccorm

Apologies for the delay, found an issue with backwards compatibility so went back to fix that. Should be good to review now.

connor-mccorm avatar Jul 26 '22 20:07 connor-mccorm

Overall this is great - awesome work 👏 ! I've left a few minor comments on some potential refactoring but those are mostly nits.

arnavgarg1 avatar Aug 02 '22 21:08 arnavgarg1

Will we be replacing dictionary accesses with the schema class in a follow up PR?

jppgks avatar Aug 03 '22 14:08 jppgks

@jppgks Yes, I am currently writing the config object which joins all the schemas together into the internal replacement for the config in a separate PR.

connor-mccorm avatar Aug 03 '22 16:08 connor-mccorm

Can this be closed now that #2370 is merged? @connor-mccorm @dantreiman

jppgks avatar Aug 18 '22 10:08 jppgks

Indeed it can! @jppgks

connor-mccorm avatar Aug 18 '22 15:08 connor-mccorm