Hima Patel comments

Results 46 comments of


                                            Hima Patel

Repo level concatenation of data

Thank you for your response. Is this done for all languages in the data?

Repo level concatenation of data

Thank you @guoday

Repo level concatenation of data

@guoday Do you then do [repo level dedup](https://github.com/deepseek-ai/DeepSeek-Coder/issues/42#issuecomment-1825826360) for all programming languages or just the above languages?

Repo level concatenation of data

@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.

Repo level concatenation of data

Do you have your own repo level benchmark or use a standard one?

Repo level concatenation of data

Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!

Repo level concatenation of data

@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks

Dedup of code during data prep

@guoday Thanks for your response. So if I understand right, you employ fuzzy dedup at repo level. Is that correct?

Dedup of code during data prep

@guoday Can you share some details on the model architecture that you used for this work?

Dedup of code during data prep

Thank you @guoday