What exactly should the result of `\s*#[^"']+$` be?

Open anywo opened this issue 7 months ago • 1 comments

What would you like to share?

Code

import re
a = """
py_path = os.path.abspath(sys.argv[0])  # Absolute path of the current script  
str = '#FFFFFF'  # Color  
py_dir = os.path.dirname(py_path)  # Directory where the current script is located  
"""
pattern = '\\s*#[^"\']+$'
regex = re.compile(pattern, flags=re.MULTILINE) if isinstance(pattern, str) else pattern
print(re.sub(regex, "", a))

Expected behavior


py_path = os.path.abspath(sys.argv[0])
str = '#FFFFFF'
py_dir = os.path.dirname(py_path)

Actual behavior


py_path = os.path.abspath(sys.argv[0])
str = '#FFFFFF'

VsCode Result(Consistent with my understanding)

Js Result

Java Result

Additional information

No response

May 30 '25 09:05 anywo

\S * # [^ "']+? $ is correct

May 30 '25 09:05 anywo

Theregex pattern **\s*#[^"']+$** only removes comments that match a specific format, but the real issue is that when it does match, it removes everything from the comment marker # to the end of the line, even if there’s valid code after it. A safer solution is to use the **tokenize** module, which correctly distinguishes between code and actual comments. import tokenize import io

a = ''' py_path = os.path.abspath(sys.argv[0]) # Absolute path of the current script
str = '#FFFFFF' # Color
py_dir = os.path.dirname(py_path) # Directory where the current script is located
''' tokens = tokenize.generate_tokens(io.StringIO(a).readline) result = [(toknum, tokval) for toknum, tokval, *_ in tokens if toknum != tokenize.COMMENT] print(tokenize.untokenize(result)) `

Tokenize module breaks the string code into tokens (keywords, strings, comments, etc.).

We then filter out only the COMMENT tokens, so everything else (like # inside strings) is preserved.

Finally, untokenize() rebuilds the code without the comments, but leaves all valid code and strings completely intact.

Jul 03 '25 20:07 pak4675