Batch Transform Job fails with Internal Server Error when Data Capture is configured
Describe the bug When configuring Data Capture for a Batch Transform job using the SageMaker Python SDK, the job creation succeeds, but the execution fails with an "Internal Server Error". If Data Capture is not enabled, the job finishes successfully. This suggests a bug related to the Data Capture configuration in the Batch Transform step.
To reproduce
The setup is the same for both scenarios, with or without DataCaptureConfig:
from datetime import datetime
from sagemaker.transformer import Transformer
from sagemaker.inputs import BatchDataCaptureConfig
input_s3_data_location = "s3://bucket/prefix/batch-transform/input/input.json"
output_s3_data_location = "s3://bucket/prefix/batch-transform/output"
data_capture_destination = "s3://bucket/prefix/batch-transform/captured-data"
model_name = "my-previously-created-model"
transformer = Transformer(
model_name=model_name,
strategy="SingleRecord",
instance_count=1,
instance_type="ml.m5.large",
output_path=output_s3_data_location,
max_concurrent_transforms=1,
max_payload=6,
tags=[{"Key": "some-key", "Value": "some-value"}],
)
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
job_name = f"batch-transform-{timestamp}"
- Batch Transform job execution without DataCaptureConfig - Success
transform_arg = transformer.transform(
job_name=job_name,
data=input_s3_data_location,
data_type="S3Prefix",
content_type="application/json",
split_type="Line",
wait=True,
logs=True,
)
- Batch Transform job execution with DataCaptureConfig - Failure with an Internal Server Error
transform_arg = transformer.transform(
batch_data_capture_config=BatchDataCaptureConfig(
destination_s3_uri=data_capture_destination,
generate_inference_id=True,
),
job_name=job_name,
data=input_s3_data_location,
data_type="S3Prefix",
content_type="application/json",
split_type="Line",
wait=True,
logs=True,
)
Note: I've also tested with CSV files. The behavior is the same.
Expected behavior Enabling Data Capture for Batch Transform should not cause the job to fail with an Internal Server Error. The job should complete successfully, and captured data should be stored as configured.
Screenshots or logs
System information A description of your system. Please provide:
- SageMaker Python SDK version: 2.244.1
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): n/a
- Framework version: n/a
- Python version: 3.12
- CPU or GPU: Used instance type ml.m5.large
- Custom Docker image (Y/N): Y
Additional context n/a
We have identified that this issue is likely related to inference ID generation. The Batch Transform Job completes successfully when the BatchDataCaptureConfig is provided with generate_inference_id set to False.