VALL-E-X icon indicating copy to clipboard operation
VALL-E-X copied to clipboard

Question about text tokenizer

Open BakerBunker opened this issue 1 year ago • 4 comments

I have noticed that there are two tokenizer dicts in utils/g2p , bpe_1024 and bpe_69, which is more suitable for generation task in your actual training practice? thank you.

BakerBunker avatar Sep 08 '23 08:09 BakerBunker

bpe_1024.json is never used in training. It was an experimental and it makes no difference in this project.

Plachtaa avatar Sep 12 '23 08:09 Plachtaa

bpe_1024.json is never used in training. It was an experimental and it makes no difference in this project.

Hi, Can you give us some advice on how to make a new bpe_x.json from our own data.

yangyyt avatar Jul 28 '24 03:07 yangyyt

bpe_1024.json is never used in training. It was an experimental and it makes no difference in this project.

Hi, Can you give us some advice on how to make a new bpe_x.json from our own data.

Oh, after reading the code, I probably know how to prepare a new bpe, convert the training data into ipa format , and then use bpetokenizer to train to generate bpe.json, right? @Plachtaa

yangyyt avatar Jul 28 '24 05:07 yangyyt

bpe_1024.json is never used in training. It was an experimental and it makes no difference in this project.

Hi, Can you give us some advice on how to make a new bpe_x.json from our own data.

Oh, after reading the code, I probably know how to prepare a new bpe, convert the training data into ipa format , and then use bpetokenizer to train to generate bpe.json, right? @Plachtaa

you are right

Plachtaa avatar Jul 29 '24 15:07 Plachtaa