What exactly should the result of `\s*#[^"']+$` be?
What would you like to share?
Code
import re
a = """
py_path = os.path.abspath(sys.argv[0]) # Absolute path of the current script
str = '#FFFFFF' # Color
py_dir = os.path.dirname(py_path) # Directory where the current script is located
"""
pattern = '\\s*#[^"\']+$'
regex = re.compile(pattern, flags=re.MULTILINE) if isinstance(pattern, str) else pattern
print(re.sub(regex, "", a))
Expected behavior
py_path = os.path.abspath(sys.argv[0])
str = '#FFFFFF'
py_dir = os.path.dirname(py_path)
Actual behavior
py_path = os.path.abspath(sys.argv[0])
str = '#FFFFFF'
VsCode Result(Consistent with my understanding)
Js Result
Java Result
Additional information
No response
\S * # [^ "']+? $ is correct
Theregex pattern **\s*#[^"']+$** only removes comments that match a specific format, but the real issue is that when it does match, it removes everything from the comment marker # to the end of the line, even if there’s valid code after it. A safer solution is to use the **tokenize** module, which correctly distinguishes between code and actual comments.
import tokenize
import io
a = '''
py_path = os.path.abspath(sys.argv[0]) # Absolute path of the current script
str = '#FFFFFF' # Color
py_dir = os.path.dirname(py_path) # Directory where the current script is located
'''
tokens = tokenize.generate_tokens(io.StringIO(a).readline)
result = [(toknum, tokval) for toknum, tokval, *_ in tokens if toknum != tokenize.COMMENT]
print(tokenize.untokenize(result))
`
Tokenize module breaks the string code into tokens (keywords, strings, comments, etc.).
We then filter out only the COMMENT tokens, so everything else (like # inside strings) is preserved.
Finally, untokenize() rebuilds the code without the comments, but leaves all valid code and strings completely intact.