CodeGen2 How many tokens were used for training?

How many tokens were used for training?

Open edward-io opened this issue 1 year ago • 0 comments

Curious to know how many tokens the models have seen. The repo mentions the dataset, but not the totals.

This checkpoint is trained on the stricter permissive subset of the deduplicated version of the Stack dataset (v1.1). Supported languages (and frameworks) are as follows: c, c++, c-sharp, dart, go, java, javascript, kotlin, lua, php, python, ruby, rust, scala, shell, sql, swift, typescript, vue.

Thanks!

May 04 '23 10:05 edward-io

CodeGen2 CodeGen2 copied to clipboard

How many tokens were used for training?

CodeGen2
CodeGen2 copied to clipboard