deep-code-search icon indicating copy to clipboard operation
deep-code-search copied to clipboard

Dataset

Open villmow opened this issue 6 years ago • 12 comments

Hi guys, awesome project. Would you mind releasing the original training and testing dataset, without any pickled or preprocessed files?

villmow avatar Aug 06 '18 12:08 villmow

@villmow Check out the files can be downloaded from here on Gdrive or Baidu url.

0bserver07 avatar Dec 04 '18 00:12 0bserver07

@0bserver07 @guxd could you share the raw method names\descriptions?

matanpugach avatar Dec 04 '18 07:12 matanpugach

@matanpugach That's what I meant!

I already downloaded the files from Gdrive, but they are preprocessed. Everyone using your dataset is limited to your features (token, api sequence, name tokens) and a vocabulary limit of 10,000. One cannot restore the original "RawCode" - "Documentation" mapping from your dataset, to - for example - try new features.

villmow avatar Dec 04 '18 08:12 villmow

Hi @villmow, I have the same requirement, have you got the raw datasets without preprocessing from @guxd?

wanyao1992 avatar Dec 17 '18 18:12 wanyao1992

i have python raw data how to preprocess it as same u did for java. I would to build code search for python code. How to do preprocessing??

gauravkoradiya avatar Apr 02 '19 12:04 gauravkoradiya

@gauravkoradiya You should use python code parser. Python provides an ast module which supports the parsing.
Here is a sample project which parses python code into ASTs. https://github.com/fyrestone/pycode_similar You may find more from the GitHub. After that, you need to convert ASTs into call sequences.

guxd avatar Apr 02 '19 12:04 guxd

@gauravkoradiya You should use python code parser. Python provides an ast module which supports the parsing. Here is a sample project which parses python code into ASTs. https://github.com/fyrestone/pycode_similar You may find more from the GitHub. After that, you need to convert ASTs into call sequences.

gauravkoradiya avatar Apr 19 '19 11:04 gauravkoradiya

Awesome..thank you....I got it.

gauravkoradiya avatar Apr 19 '19 11:04 gauravkoradiya

could you share the original datasets without any pickled or preprocessed files?

hoogang avatar May 21 '19 05:05 hoogang

I agree with @hoogang . @guxd , would you please release your raw or original datasets without any pickled or preprocessed files? Thank you.

jackalhan avatar Sep 24 '19 05:09 jackalhan

It's a pity the authors do not release original dataset

LeeSureman avatar Oct 11 '21 09:10 LeeSureman

The raw code datasets are available at /pytorch/train.rawcode.rar

guxd avatar Jun 06 '23 03:06 guxd