comment_parser Suggest add an option to ignore special encoding characters

Suggest add an option to ignore special encoding characters

Open ghost opened this issue 5 years ago • 3 comments

Hi, this tool works well in many cases. But I found two problems.

Encoding problem

If a file contains other encoding characters, e.g., Chinese characters and ½, an exception will occur in extract_comments method.

I added "errors='ignore'" in the following statement on my local computer, and it can ignore the above special characters and continue to extract the rest characters of a comment.

def extract_comments(filename, mime=None):
    with open(filename, 'r', errors='ignore') as code:

So I think we can provide this option to users and let them determine to ignore or not.

Complex string

The tool throws an exception when parser this java file. I found the cause may be the complex string in line 99.

Thanks for your tool, it helps me a lot. Hope better~

Apr 13 '20 01:04 ghost

If a file contains other encoding characters, e.g., Chinese characters and ½, an exception will occur in extract_comments method.

Do you have an example to reproduce this? I played around with some Chinese characters and everything worked as it should; including the Server.java you linked.

The tool throws an exception when parser this java file. I found the cause may be the complex string in line 99.

Thanks for pointing this out, I've fixed this yesterday in #26.

$ wget https://raw.githubusercontent.com/88250/symphony/master/src/main/java/org/b3log/symphony/Server.java
$ python3
>> from comment_parser import comment_parser
>> len(comment_parser.extract_comments('Server.java', 'text/x-java'))
7

Sep 04 '20 00:09 jeanralphaviles

#include "stdio.h"

int main(char** argv, int argc) {
  // Prints ½,
  printf("½\n");
  // Prints 你好，世界
  printf("你好，世界\n");
  return 0;
}

$ python3 -m comment_parser.comment_parser test.c
 Prints ½,
 Prints 你好，世界

Sep 04 '20 02:09 jeanralphaviles

I got some Unicode Error a few weeks back while executing it on Linux, but didn't in windows. I used encoding='utf-8' format while opening a .java file . But didn't solve the issue either.

Oct 11 '20 11:10 muneersyed156

comment_parser comment_parser copied to clipboard

Suggest add an option to ignore special encoding characters

comment_parser
comment_parser copied to clipboard