amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

[Example Request] SM Pipeline with built-in LightGBM, AutoGluon, CatBoost, TabTransformer algorithm

Open athewsey opened this issue 2 years ago • 2 comments

Describe the use case example you want to see

A SageMaker Pipeline to train, evaluate, and register a model using one (or more?) of the new JumpStart-based built-in algorithms for tabular data: Preferably via the SageMaker SDK + PipelineSession.

How would this example be used? Please describe.

The new JumpStart-based tabular built-in algorithms (AutoGluon-Tabular, CatBoost, LightGBM, TabTransformer) have some extra usage complexities beyond XGBoost:

  • Separate container image URIs must be used for training vs inference, otherwise errors will generally be thrown due to missing libraries/executables/etc.
  • Script bundles must be looked up (via e.g. sagemaker.script_uris.retrieve()) and provided to both the training and inference stages - and also the models created by these training jobs appear to require re-packing to properly insert inference scripts.
  • "Pre-trained" model artifacts seem to be mandatory (via e.g. sagemaker.model_uris.retrieve()) for the training job.
  • Data channel structure is different, using a single training channel with specifically named subfolders and files, instead of separate train, validation, etc channels.

We have sample notebooks available for these algorithms, usually listed on the algorithm doc pages themselves e.g. here for AutoGluon... But as far as I've found, the only samples for SM Pipelines tend to be XGBoost-based or using custom models.

The extra complexity (around image, script and model artifact URIs in particular) can make it a challenge for customers who aren't yet familiar with script mode (only trying out and comparing built-in algorithms) to get started with these more advanced tabular algorithms: It's not straightforward today, to take an XGBoost sample and just plug in a different algorithm name.

So I suggest it'd be helpful to either extend an existing sample, or add a new sample, to show how pipelining translates from XGBoost to the other tabular algorithms?

Describe which SageMaker services are involved

  • Pipelines
  • Built-in algorithms (JumpStart-based)

Describe what other services (other than SageMaker) are involved*

  • None?

Describe which dataset could be used. Provide its location in s3://sagemaker-sample-files or another source.

athewsey avatar Dec 05 '22 08:12 athewsey

+1

anand086 avatar Dec 16 '23 21:12 anand086

Since I was able to put together a simple pipeline with LightGBM using some docs, I'm sourcing it here for anyone in need:

from sagemaker import Session, image_uris, script_uris, model_uris, hyperparameters
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

sess = Session()
aws_region = "us-east-1"
aws_role="AWS_ROLE"
train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
training_instance_type = "ml.m5.xlarge"

training_data_prefix = "training-datasets/tabular_multiclass/"
training_dataset_s3_path = f"s3://jumpstart-cache-prod-{aws_region}/{training_data_prefix}train" 
validation_dataset_s3_path = f"s3://jumpstart-cache-prod-{aws_region}/{training_data_prefix}validation" 
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Retrieve the default hyperparameters for training the model
hyperparams = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)
# [Optional] Override default hyperparameters with custom values
hyperparams["num_boost_round"] = "500"

train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=train_model_id,
    model_version=train_model_version,
    image_scope=train_scope,
    instance_type=training_instance_type
)

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)
train_model_uri = model_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

# Create SageMaker Estimator instance
lgbm_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1, # for distributed training, specify an instance_count greater than 1
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparams,
    output_path=s3_output_location
)

step_train = TrainingStep(
    name="LGBMTraining",
    estimator=lgbm_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=training_dataset_s3_path,
            content_type="text/csv"
        ),
        "validation": TrainingInput(
            s3_data=validation_dataset_s3_path,
            content_type="text/csv"
        )
    }
)

pipeline = Pipeline(
    name="TestLGBM",
    steps=[step_train],
    sagemaker_session=sess
)
pipeline.upsert(role_arn=aws_role)
start_response = pipeline.start()

Eduarcher avatar Jun 10 '24 18:06 Eduarcher