CodeGen2
CodeGen2 copied to clipboard
How many tokens were used for training?
Curious to know how many tokens the models have seen. The repo mentions the dataset, but not the totals.
This checkpoint is trained on the stricter permissive subset of the deduplicated version of the Stack dataset (v1.1). Supported languages (and frameworks) are as follows: c, c++, c-sharp, dart, go, java, javascript, kotlin, lua, php, python, ruby, rust, scala, shell, sql, swift, typescript, vue.
Thanks!