fish-speech icon indicating copy to clipboard operation
fish-speech copied to clipboard

In generate.py , prompt_length > 4096(generate_long), lead to max_new_tokens< 0.

Open wsd12345 opened this issue 1 year ago • 4 comments

Self Checks

  • [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

I have texts, eg: ["辉瑞制药又一次敏锐地捕捉到了时代的脉搏。", "1951年,辉瑞制药再次取得了一项重大的科研突破。"], When llama cannot predict stop sign 4 on texts[0]. Now, [generate.py 539](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#:~:text=538-,539,-540), decoded is a vector with nearly 4096 dimensions. so that, max_new_tokens = T_new - T <0. error

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

wsd12345 avatar Sep 24 '24 03:09 wsd12345

Your reference audio length should not exceed 90 seconds.

AnyaCoder avatar Sep 24 '24 06:09 AnyaCoder

Your reference audio length should not exceed 90 seconds.

Thanks for your response. I use 23s of reference audio。In decode_n_tokens,I find that stop sign is not predicted. The dimensions of the token are as follows: 1.

I forgot to say it above, this problem was found without using multinomial sampling,

def multinomial_sample_one_no_sync(
        probs_sort,
):  # Does multinomial sampling without a cuda synchronization
    # q = torch.empty_like(probs_sort).exponential_(1)
    # return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
    return torch.argmax(probs_sort, dim=-1, keepdim=True).to(dtype=torch.int)

My reference audio coding features: ref.zip

wsd12345 avatar Sep 24 '24 06:09 wsd12345

You 100% need multinomial sampling, argmax will cause repetition pattern.

leng-yue avatar Sep 24 '24 09:09 leng-yue

You 100% need multinomial sampling, argmax will cause repetition pattern.

The generated audio waveform is indeed repetitive noise at the back. But I don't know why it keeps repeating, if increasing the training data will improve the repetition problem?

wsd12345 avatar Sep 24 '24 09:09 wsd12345

Using greedy for any LLM can also meet same issue.

leng-yue avatar Sep 25 '24 03:09 leng-yue

Using greedy for any LLM can also meet same issue.

Thanks。

wsd12345 avatar Sep 25 '24 05:09 wsd12345