nlp.js icon indicating copy to clipboard operation
nlp.js copied to clipboard

Has anyone predigested a large training set?

Open ninjamoba opened this issue 4 years ago • 2 comments

Has anyone tried training with : https://pile.eleuther.ai/

Perhaps we can start a shared library of pretrained corpus from this set as a general starting point?

Any suggestions about the best way to use this above set? Would this be performant - could it scale to GPT-3 scope?

Or does this defeat the intended purpose of this repository as a "light" NLP library?

ninjamoba avatar Feb 08 '21 14:02 ninjamoba

NLP.js is a set of libraries to do NLP in javascript, mainly intended to build Conversational AI. You can do a lot of things with NLP.js that are more generalistic: normalize, tokenize, stem, calculate freqs, n-grams, .... But is clearly not GPT-3, GPT-3 training cost is around 4.600.000$ (https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/)

My poor laptop does not have even enough space for this 800GB of data in HD, I don't even imagine how to handle such an amount of data in terms of memory. So for working with such an amount of data, the infrastructure cost is something to take into account. So I'm sorry, but I will not even try :(

jesus-seijas-sp avatar Feb 08 '21 14:02 jesus-seijas-sp

ok so maybe not on your lap top ;) - You and NLP.js are such a beacon of hope. If this is only a horsepower issue - I think we can figure out a way to get some of these libraries digested. even just to experiment. You know GPT-3 is trained on really dirty data and these training libraries seem legit - From your response I can see Its not a bad project - and we can eat the elephant bite by bite. :)

ninjamoba avatar Feb 08 '21 19:02 ninjamoba

Closing due to inactivity

aigloss avatar Nov 24 '22 12:11 aigloss