Questions Regarding CoIR Dataset Usage in Code Explanation Retrieval

Open vaishnavirshah opened this issue 8 months ago • 0 comments

I’ve been referring to the CoIR paper and codebase—thank you for making this valuable resource available!

I had a question regarding dataset handling in your work.

For datasets retrieved via Hugging Face (like CodeSearchNet), is any preprocessing (e.g., stripping comments from code) applied before retrieval? I couldn't find related scripts in the repo.

I noticed that comments from the code sometimes appear as queries in all splits(train, valid, test). For Example:

corpus:

queries:

qrel:

Apr 07 '25 23:04 vaishnavirshah