PhiCookBook
PhiCookBook copied to clipboard
Issue running the audio example
Hi, when i tried to run the audio transcription example in https://github.com/microsoft/PhiCookBook/blob/main/md/02.Application/05.Audio/Phi4/Transciption/README.md
I've encountered an error where,
File "/mnt/nvme1/huggingface_cache/modules/transformers_modules/microsoft_Phi-4-multimodal-instruct/modeling_phi4mm.py", line 2137, in forward
logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
^^^^^^^^^^^^^^^^^^^
TypeError: bad operand type for unary -: 'NoneType
Hi can you try this
import requests
import torch
import soundfile
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
model_path = 'Your Phi-4-multimodal location'
# Load processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Load model with consistent dtype settings
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # Be explicit about dtype
use_flash_attention_2=True, # Use this instead of _attn_implementation
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
# Load audio - make sure to get both the audio data and sample rate
audio_data, sample_rate = soundfile.read('./ignite.wav')
audio = (audio_data, sample_rate) # Pass as tuple with sample rate
# Process inputs
inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')
# Generate with output_scores=True to ensure proper tracking of logits
generate_ids = model.generate(
**inputs,
max_new_tokens=1200,
generation_config=generation_config,
output_scores=True, # Add this to ensure proper logit handling
)
# Extract only the new tokens from the output
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
I was also faced with this problem. I followed your code but got TypeError: bad operand type for unary -: 'NoneType'
from PIL import Image
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
audio = soundfile.read(dataset["Audio"][0])
inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1200,
generation_config=generation_config,
)
The last line returns this error
Following: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/discussions/46,
it seems should add argument to generate method num_logits_to_keep= :
generate_ids = model.generate(
**inputs,
max_new_tokens=1200,
generation_config=generation_config,
num_logits_to_keep=1
)
or modify modelling_phi4mm.py :
num_logits_to_keep = hidden_states.size(1) if num_logits_to_keep is None else num_logits_to_keep