djl-serving DJL running without speculative decoding

Hello. I am using 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 container to run inference for Llama3.3-70B-Instruct. The container is being launched using Docker. Have created repo dir with 2 models : 70B model and 8B model (model ids : mymodel and mymodeldraft) Here are serving.properties of both models:

For 70B model (mymodel):

engine=Python option.mpi_mode=True option.tensor_parallel_degree=8 option.trust_remote_code=true option.rolling_batch=lmi-dist option.max_input_len=32768 option.max_output_len=32768 option.max_model_len=32768 option.gpu_memory_utilization=0.5 option.max_rolling_batch_size=32 option.enable_prefix_caching=true option.enable_streaming=false option.speculative_draft_model=mymodeldraft option.draft_model_tp_size=8 option.speculative_length=5

For 8B model (mymodeldraft):

engine=Python option.mpi_mode=True option.tensor_parallel_degree=8 option.trust_remote_code=true option.rolling_batch=lmi-dist option.max_input_len=32768 option.max_output_len=32768 option.max_model_len=32768 option.gpu_memory_utilization=0.4 option.max_rolling_batch_size=32 option.enable_prefix_caching=true option.enable_streaming=false

Launching the container, DJL is starting, both model are loaded. But in the log I see message :

INFO PyProcess W-749-mymodel-stdout: [1,0]<stdout>:WARNING 01-24 08:07:00 arg_utils.py:66] Speculative decoding feature is only available on SageMaker. Running without speculative decoding... When running inference, seems that speculative decoding is not active, draft model is not being called.

We are running DJL container on SageMaker endpoints. Can you please explain how can we make this feature work ? Thank you.

Jan 24 '25 09:01 eduardzl

I believe the issue is due to how you are specifying the models. With speculative decoding, you should not define the speculative model as a separate model. You must only provide a single serving.properties file. When you provide two, DJL-Serving treats these as two separate models.

This is how I would recommend structuring your model directory in s3:

<base model artifacts>
serving.properties
speculative_model/
  - <speculative model artifacts>

I validated this with the same image you mentioned 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124.

In s3, I had the structure mentioned above.

serving.properties
*.safetensors
config.json
tokenizer.json
tokenizer_config.json
llama-3-1b/
  *.safetensors
  config.json
  tokenizer.json
  tokenizer_config.json

model_data = {
    "S3DataSource": {
        "S3Uri": "s3://my-bucket/llama-3-spec-dec/",
        "S3DataType": "S3Prefix",
        "CompressionType": "None"
    }
}

model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=role,
)

When deployed on a Sagemaker endpoint, I can confirm specultiave decoding is enabled via the following log:

[INFO ] PyProcess - W-161-model-stdout: [1,0]<stdout>:INFO 02-13 20:36:56 arg_utils.py:67] Found draft_model parameter: llama-3-1b, will apply speculative patch

Feb 13 '25 20:02 siddvenk

@siddvenk Thank you, will try that.

Feb 15 '25 10:02 eduardzl

This issue is stale because it has been open for 30 days with no activity.

Oct 23 '25 19:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Nov 07 '25 19:11 github-actions[bot]