How to extended window size during train step2?
https://github.com/deepseek-ai/DeepSeek-Coder#model-training descibe "Further Pre-training using an extended 16K window size on an additional 200B tokens", How to extend the window size during training? Just modify max_length and max_position_embeddings in the “config.json” file? or what I need to do?
please check our technical report. https://arxiv.org/pdf/2401.14196.pdf
The technical report shows that only 8B data is used to train the long context, while the readme shows that 200B is trained on the 16K window. So I am a little confused here. If the pretrain truncation does have 200B data for 16K window training, then when was the base frequency and scaling factor of the rope modified? Thank you.
Content in the technical report: "The model underwent an additional 1000 steps of training, using a batch size of 512 and a sequence length of 16K. "
@guoday