fast-bert icon indicating copy to clipboard operation
fast-bert copied to clipboard

Vocab size while fine tuning language model

Open Sagar1094 opened this issue 4 years ago • 3 comments

Hi, I used around 8000000 text sentences while fine tuning the language model but the newly added vocabulary size is only 50000. My data have atleast around 1000000-2000000 tokens to be added. Can, I explicitly change the vocab size while fine tuning? Thanks

Sagar1094 avatar May 29 '20 01:05 Sagar1094

@Sagar1094 can you please share the code that you are using for lm fine tuning ? thanks

krannnn avatar Jun 10 '20 11:06 krannnn

Hi, I have followed the tutorial for the same. Regards, Sagar

On Wed, Jun 10, 2020, 5:06 PM krannnn [email protected] wrote:

@Sagar1094 https://github.com/Sagar1094 can you please share the code that you are using for lm fine tuning ? thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaushaltrivedi/fast-bert/issues/223#issuecomment-641943784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANALH756XKDACY35YAOXHRDRV5V3LANCNFSM4NNUHYQQ .

Sagar1094 avatar Jun 10 '20 11:06 Sagar1094

Hi, My data is little different, I have indian addresses for example "i 32 mangol puri delhi", "b-8/205 rohini delhi", "kormangalam bengaluru". I want to create a address classifier. These addresses have labels assosiated to them as well like "26-0", "23-2".

Using BERT pre trained I think it is impossible to train this kind of data as most of the words would be out of vocab. Can you please help me and suggest an alternative approach. I have tried training a bert, electra, roberta models from scratch with huge size of vocab - 2800000 words but it is failing. So i tried fine tuning fast-bert which aslo dosent work.

Please help 🙏😊 Regards, Sagar Gupta +91 8826361028

On Wed, Jun 10, 2020, 5:12 PM Sagar Gupta [email protected] wrote:

Hi, I have followed the tutorial for the same. Regards, Sagar

On Wed, Jun 10, 2020, 5:06 PM krannnn [email protected] wrote:

@Sagar1094 https://github.com/Sagar1094 can you please share the code that you are using for lm fine tuning ? thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaushaltrivedi/fast-bert/issues/223#issuecomment-641943784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANALH756XKDACY35YAOXHRDRV5V3LANCNFSM4NNUHYQQ .

Sagar1094 avatar Jun 11 '20 02:06 Sagar1094