RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

where is the FastText ptrtrained model to classify each CommonCrawl webpage

Open yuhai-china opened this issue 1 year ago • 2 comments

First of all: thank you very much for your contribution!

Many thanks if you can share the FastText ptrtrained model to classify each CommonCrawl webpage whether it is low quality page

yuhai-china avatar Apr 24 '23 03:04 yuhai-china

You can download it here: https://fasttext.cc/docs/en/language-identification.html

xzyaoi avatar Apr 25 '23 14:04 xzyaoi

You can download it here: https://fasttext.cc/docs/en/language-identification.html

thanks, I want to find the mode to classify the web page whether it is low quality instead of language identification

yuhai-china avatar Apr 26 '23 02:04 yuhai-china

Hi, @yuhai-china Here's the model file that we trained (link). You can use the script here to load the model and run inference.

I will also add the link of the model weight to our README.

Ivan-Zhou avatar Apr 29 '23 21:04 Ivan-Zhou

Hi, @yuhai-china Here's the model file that we trained (link). You can use the script here to load the model and run inference.

I will also add the link of the model weight to our README.

thank you very much. it works well

yuhai-china avatar Apr 30 '23 08:04 yuhai-china