code-transformer
code-transformer copied to clipboard
PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch
Hi, Preprocessing is running fine without errors for few codes, but it is throwing the exception "PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch" for few other codes. Can anyone tell how to overcome this error?
The code being used for preprocessing is from the interactive_prediction.ipynb notebook provided.
preprocessor = CTStage1Preprocessor(code_snippet_language, allow_empty_methods=True)
stage1_sample = preprocessor.process([("", "", code_snippet)], 0)
Hi,
thanks for your interest in the Code Transformer.
This PreprocessingException just means that all code snippets in a preprocessing batch (usually contains 10 snippets) were filtered out due to various reasons:
- the snippet is too long (in our experiments, we used a threshold of 10000 tokens)
- if the snippet has no body (i.e., just a method definition) and
allow_empty_methods = False - If any error happend for a snippet during tokenization, comment removing, empty lines removing, whitespace removing, string/numbers masking, etc... (You can see the list of preprocessing modules here: https://github.com/danielzuegner/code-transformer/blob/c7eb56e895cd70307cf4a69cb6c5d8495d17b469/code_transformer/preprocessing/pipeline/stage1.py#L44)
You can adapt the filtering behaviour with the preprocessing config such as the one we used for preprocessing the CSN code snippets. That some snippets will be filtered out is normal. However, if you suspect that snippets are dropped that should be ok, it is probably because of some formatting issue. Do you have example snippets that cannot be preprocessed?