CodeGen2 icon indicating copy to clipboard operation
CodeGen2 copied to clipboard

How many tokens were used for training?

Open edward-io opened this issue 1 year ago • 0 comments

Curious to know how many tokens the models have seen. The repo mentions the dataset, but not the totals.

This checkpoint is trained on the stricter permissive subset of the deduplicated version of the Stack dataset (v1.1). Supported languages (and frameworks) are as follows: c, c++, c-sharp, dart, go, java, javascript, kotlin, lua, php, python, ruby, rust, scala, shell, sql, swift, typescript, vue.

Thanks!

edward-io avatar May 04 '23 10:05 edward-io