amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
[Bug Report] 'text/csv; charset=utf-8` is not supported in Sagemaker Pipeline with sklearn and xgboost models
Hi, I have created a Sagemaker Pipeline Model using an Sklearn model followed by an xgboost model. I followed the instructions here to set the 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT' environment variable but I'm getting an
ValueError: Content type text/csv; charset=utf-8 is not supported. error when running the batch transform job on the 2nd (xgboost) container that is following the sklearn container.
My pipeline code looks as following:
feature_model = SKLearnModel(
model_data=feature_model_s3_path,
sagemaker_session=sagemaker_session,
role=role,
framework_version="0.23-1",
entry_point = os.path.join(BASE_DIR, "scripts", "sagemaker_feature_transform.py")
)
feature_model.env = {"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT":"text/csv"}
model = XGBoostModel(
framework_version="1.0-1",
model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
sagemaker_session=sagemaker_session,
entry_point= os.path.join(BASE_DIR, "scripts", "sagemaker_xgb_training.py"),
role=role
)
pipeline_model = PipelineModel(
name= "pipeline-model",
role=role,
models=[feature_model, model],
sagemaker_session=sagemaker_session
)
As inference output code of container 1 (sklearn) I am using:
from sagemaker_containers.beta.framework import ( encoders, worker)
def output_fn(prediction, accept):
if accept == "text/csv":
return worker.Response(encoders.encode(prediction, accept), mimetype=accept)
else:
raise RuntimeException("{} accept type is not supported by this script.".format(accept))
As inference input code of container 2 (xgb) I am using:
def input_fn(request_body, request_content_type):
if request_content_type == "text/libsvm":
return xgb_encoders.libsvm_to_dmatrix(request_body)
elif request_content_type == "text/csv":
return xgb_encoders.csv_to_dmatrix(request_body)
else:
raise ValueError("Content type {} is not supported.".format(request_content_type))
It seems like even though I am forcing the output content type of container 1 to "text/csv", what is arriving in container 2 is an unkown "text/csv; charset=utf-8" format. Any ideas of what I am doing wrong ?
Thank you for your help!
I solved the problem by adding the following code to the xgboost inference script "sagemaker_xgb_training.py":
def rchop(s, suffix):
if suffix and s.endswith(suffix):
return s[:-len(suffix)]
return s
def input_fn(request_body, request_content_type):
if request_content_type == "text/csv; charset=utf-8":
request_body = request_body.decode('utf-8')
request_body = rchop(request_body, '\n')
return xgb_encoders.csv_to_dmatrix(request_body)
else:
raise ValueError("Content type {} is not supported.".format(request_content_type))
This adds the missing "text/csv; charset=utf-8" content type, decodes the request body and removes ending "\n" characters before calling the xgb_encoder.
I experience same issue, but while using sklearn-preprossesor -> LGMB pipeline.
@Kyparos Did you resolve your issue ? Even I am facing the same problem with sklearnPreprocessor -> sagemaker LGMB
I believe the problem is due to Pipeline Models honoring input and output content types but not in-between.
I.e. when you start a Batch Transform, you can set it to input content type to be text/csv and output content type to be text/csv. However, this will not set output content type to text/csv on the first container (you can verify by logs), therefore it will revert back to application/json, thus making the 2nd container fail.