amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

[Example Request] SM Pipeline with built-in LightGBM, AutoGluon, CatBoost, TabTransformer algorithm

Open athewsey opened this issue 1 year ago • 2 comments

Describe the use case example you want to see

A SageMaker Pipeline to train, evaluate, and register a model using one (or more?) of the new JumpStart-based built-in algorithms for tabular data: Preferably via the SageMaker SDK + PipelineSession.

How would this example be used? Please describe.

The new JumpStart-based tabular built-in algorithms (AutoGluon-Tabular, CatBoost, LightGBM, TabTransformer) have some extra usage complexities beyond XGBoost:

  • Separate container image URIs must be used for training vs inference, otherwise errors will generally be thrown due to missing libraries/executables/etc.
  • Script bundles must be looked up (via e.g. sagemaker.script_uris.retrieve()) and provided to both the training and inference stages - and also the models created by these training jobs appear to require re-packing to properly insert inference scripts.
  • "Pre-trained" model artifacts seem to be mandatory (via e.g. sagemaker.model_uris.retrieve()) for the training job.
  • Data channel structure is different, using a single training channel with specifically named subfolders and files, instead of separate train, validation, etc channels.

We have sample notebooks available for these algorithms, usually listed on the algorithm doc pages themselves e.g. here for AutoGluon... But as far as I've found, the only samples for SM Pipelines tend to be XGBoost-based or using custom models.

The extra complexity (around image, script and model artifact URIs in particular) can make it a challenge for customers who aren't yet familiar with script mode (only trying out and comparing built-in algorithms) to get started with these more advanced tabular algorithms: It's not straightforward today, to take an XGBoost sample and just plug in a different algorithm name.

So I suggest it'd be helpful to either extend an existing sample, or add a new sample, to show how pipelining translates from XGBoost to the other tabular algorithms?

Describe which SageMaker services are involved

  • Pipelines
  • Built-in algorithms (JumpStart-based)

Describe what other services (other than SageMaker) are involved*

  • None?

Describe which dataset could be used. Provide its location in s3://sagemaker-sample-files or another source.

athewsey avatar Dec 05 '22 08:12 athewsey