deep-code-search
deep-code-search copied to clipboard
Dataset
Hi guys, awesome project. Would you mind releasing the original training and testing dataset, without any pickled or preprocessed files?
@0bserver07 @guxd could you share the raw method names\descriptions?
@matanpugach That's what I meant!
I already downloaded the files from Gdrive, but they are preprocessed. Everyone using your dataset is limited to your features (token, api sequence, name tokens) and a vocabulary limit of 10,000. One cannot restore the original "RawCode" - "Documentation" mapping from your dataset, to - for example - try new features.
Hi @villmow, I have the same requirement, have you got the raw datasets without preprocessing from @guxd?
i have python raw data how to preprocess it as same u did for java. I would to build code search for python code. How to do preprocessing??
@gauravkoradiya You should use python code parser. Python provides an ast
module which supports the parsing.
Here is a sample project which parses python code into ASTs.
https://github.com/fyrestone/pycode_similar
You may find more from the GitHub.
After that, you need to convert ASTs into call sequences.
@gauravkoradiya You should use python code parser. Python provides an
ast
module which supports the parsing. Here is a sample project which parses python code into ASTs. https://github.com/fyrestone/pycode_similar You may find more from the GitHub. After that, you need to convert ASTs into call sequences.
Awesome..thank you....I got it.
could you share the original datasets without any pickled or preprocessed files?
I agree with @hoogang . @guxd , would you please release your raw or original datasets without any pickled or preprocessed files? Thank you.
It's a pity the authors do not release original dataset
The raw code datasets are available at /pytorch/train.rawcode.rar