PhiCookBook Issue running the audio example

Hi, when i tried to run the audio transcription example in https://github.com/microsoft/PhiCookBook/blob/main/md/02.Application/05.Audio/Phi4/Transciption/README.md

I've encountered an error where,

  File "/mnt/nvme1/huggingface_cache/modules/transformers_modules/microsoft_Phi-4-multimodal-instruct/modeling_phi4mm.py", line 2137, in forward
    logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
                                           ^^^^^^^^^^^^^^^^^^^
TypeError: bad operand type for unary -: 'NoneType

Mar 16 '25 00:03 noobHappylife

Hi can you try this

import requests
import torch
import soundfile
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = 'Your Phi-4-multimodal location'

# Load processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Load model with consistent dtype settings
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # Be explicit about dtype
    use_flash_attention_2=True,  # Use this instead of _attn_implementation
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path)

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'

# Load audio - make sure to get both the audio data and sample rate
audio_data, sample_rate = soundfile.read('./ignite.wav')
audio = (audio_data, sample_rate)  # Pass as tuple with sample rate

# Process inputs
inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')

# Generate with output_scores=True to ensure proper tracking of logits
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1200,
    generation_config=generation_config,
    output_scores=True,  # Add this to ensure proper logit handling
)

# Extract only the new tokens from the output
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(response)

Mar 25 '25 14:03 leestott

I was also faced with this problem. I followed your code but got TypeError: bad operand type for unary -: 'NoneType'

from PIL import Image
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
audio = soundfile.read(dataset["Audio"][0])
inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1200,
    generation_config=generation_config,
)

The last line returns this error

Apr 27 '25 22:04 bartlomiejmarek

Following: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/discussions/46, it seems should add argument to generate method num_logits_to_keep= :

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1200,
    generation_config=generation_config,
    num_logits_to_keep=1
)

or modify modelling_phi4mm.py :

num_logits_to_keep = hidden_states.size(1) if num_logits_to_keep is None else num_logits_to_keep

Apr 27 '25 22:04 bartlomiejmarek