DeepSeek-Coder icon indicating copy to clipboard operation
DeepSeek-Coder copied to clipboard

How is the amount of training data measured?

Open WentaoChen0813 opened this issue 2 years ago • 1 comments

Thanks for open-sourcing this excellent work! I have a question about the amount of training data in your tech report and hope for your kind reply. In Table 1, there are totally 797.92 GB size of training data. Is the size measured by the number of bytes of raw text or by the size of saved files? Some file extensions may compress raw text significantly.

WentaoChen0813 avatar Feb 29 '24 02:02 WentaoChen0813

measured by len(code) function.

guoday avatar Mar 12 '24 02:03 guoday