DeepSeek-Coder
DeepSeek-Coder copied to clipboard
How is the amount of training data measured?
Thanks for open-sourcing this excellent work! I have a question about the amount of training data in your tech report and hope for your kind reply. In Table 1, there are totally 797.92 GB size of training data. Is the size measured by the number of bytes of raw text or by the size of saved files? Some file extensions may compress raw text significantly.
measured by len(code) function.