Hima Patel
Hima Patel
Thank you for your response. Is this done for all languages in the data?
Thank you @guoday
@guoday Do you then do [repo level dedup](https://github.com/deepseek-ai/DeepSeek-Coder/issues/42#issuecomment-1825826360) for all programming languages or just the above languages?
@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.
Do you have your own repo level benchmark or use a standard one?
Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!
@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks
@guoday Thanks for your response. So if I understand right, you employ fuzzy dedup at repo level. Is that correct?
@guoday Can you share some details on the model architecture that you used for this work?
Thank you @guoday