Question about Chinese Language Support and Model Retraining
❓ The question
I really appreciate the team's contribution in sharing this model for research and learning purposes. I have a question regarding its Chinese language capabilities. It appears the model is less optimized for Chinese inputs. Could you clarify:
- What percentage of the pretraining corpus consists of Chinese data?
- If I want to train a Chinese-optimized LLM based on this model, what technical recommendations would you suggest? Thank you for your guidance.
Hi there! Thanks for the kind words and the inquiry!
Our pretraining data was intentionally scoped down to include only English -- we used FastText classifiers to remove non-English text, so there will be very few Chinese texts in this data (just the ones that passed through the threshold). So, these models likely won't be great for your purposes.
@baileykuehl Hi! Thank you very much for your prompt response!
Regarding localizing this for Chinese, I imagine it would require significant effort. For instance:
Vocabulary Expansion: We'd likely need to significantly expand the current CL100K vocabulary to incorporate Chinese tokens, and then reinitialize & retrain the embedding matrices.
Extensive Pre-training: We'd need to conduct large-scale pre-training from scratch using vast amounts of Chinese text. This should be followed by continued pre-training with high-quality Chinese data.
My Understanding: Since this model is specifically optimized for English and doesn't inherently support multilingual capabilities, this overhaul would be necessary – is that correct?