code2seq icon indicating copy to clipboard operation
code2seq copied to clipboard

Encountered error of preprocess data

Open yingdehuijin opened this issue 3 years ago • 8 comments
trafficstars

Hi,Uri Hi, I am using code2seq to run on EMSE-DeepCom https://github.com/xing-hu/EMSE-DeepCom newest datasets. I followed your suggestiones to run scripts preprocess.sh,but i have encountered errors on test/val/train datasets.The error_log.txt and stdout show the following information: b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: ">" ">"\n at line 2, column 407.\n\nWas expecting one of:\n\n And examples are decreased: 20000 test methods hava decreased to 17060 , 20000 valid methods decreased to 17043 and 480000 methods decreased to 380001. Are there something wrong with the datasets? Looking forward your reply! Wcc

yingdehuijin avatar Jun 30 '22 16:06 yingdehuijin

Hi @yingdehuijin , Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best, Uri

urialon avatar Jul 03 '22 02:07 urialon

Hi @yingdehuijin , Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best, Uri

Thank you for your reply A single example from the dataset is like this: code: public static DecomposableMatchBuilder1 < Float , Float > caseFloat ( MatchesAny f ) { List < Matcher < Object > > matchers = new ArrayList < > ( ) ; matchers . add ( any ( ) ) ; return new DecomposableMatchBuilder1 < > ( matchers , NUM_ , new PrimitiveFieldExtractor < > ( Float . class ) ) ; } nl: matches a float .

yingdehuijin avatar Jul 03 '22 02:07 yingdehuijin

The "nl: matches a float" are part of the same file? Our JavaExtractor expects pure java files, and extracts the method names as the labels. You can replace the existing method name (DecomposableMatchBuilder1) with a unique ID, remove the "nl: matches a float", and later, replace the unique ID in the processed files with the natural language sequence that you wish to generate.

See also: https://github.com/tech-srl/code2seq/issues/45

Best, Uri

urialon avatar Jul 14 '22 02:07 urialon

Hello, I encountered the same issue while preprocessing the files. Does the original JAR package handle exceptions, such as skipping files that do not meet the format requirements without preprocessing them? I'm using it to process my own dataset, but it's throwing errors. I'm not sure if it will keep getting stuck there.

lidiancracy avatar Sep 17 '23 13:09 lidiancracy

Hi @lidiancracy , Thank you for your interest in our work.

The truth is that I don't remember, this code was written about 5 years ago. If you wish to debug it go ahead, the entire java code is available in this repo.

But I recommend using newer models such as PolyCoder: https://github.com/VHellendoorn/Code-LMs https://arxiv.org/pdf/2202.13169.pdf

Best, Uri

urialon avatar Sep 17 '23 15:09 urialon

@urialon Thank you for your timely reply. My .sh file now terminates normally and has produced 4 files with the .c2s extension. I think the logic in the JAR package is probably fine. By the way, can I continue to train a new dataset on a model that has been trained well, similar to transfer learning and incremental training? I did not find any relevant information in the readme, did I miss something?Thank you in advance.

lidiancracy avatar Sep 18 '23 01:09 lidiancracy

Sorry to bother you.I trained the model using default parameters, but now only the dictionary remains as shown in the picture. Is this normal? image

lidiancracy avatar Sep 19 '23 05:09 lidiancracy

Yes

urialon avatar Sep 19 '23 11:09 urialon