git-filter-repo icon indicating copy to clipboard operation
git-filter-repo copied to clipboard

asiic issue

Open 1stmud opened this issue 3 years ago • 1 comments

regex:[\u4e00-\u9fa5]{1,}==>

i want to remove chinese code , but this regex show me error:


Traceback (most recent call last):
  File "/usr/local/bin/git-filter-repo", line 4004, in <module>
    main()
  File "/usr/local/bin/git-filter-repo", line 3996, in main
    args = FilteringOptions.parse_args(sys.argv[1:])
  File "/usr/local/bin/git-filter-repo", line 2209, in parse_args
    args.replace_message = FilteringOptions.get_replace_text(args.replace_message)
  File "/usr/local/bin/git-filter-repo", line 2126, in get_replace_text
    replace_regexes.append((re.compile(regex), replacement))
  File "/Users/xxx/.pyenv/versions/3.9.13/lib/python3.9/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/Users/wudahai/.pyenv/versions/3.9.13/lib/python3.9/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/xxx/.pyenv/versions/3.9.13/lib/python3.9/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/xxxx/.pyenv/versions/3.9.13/lib/python3.9/sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/Users/xxx/.pyenv/versions/3.9.13/lib/python3.9/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/Users/xxx/.pyenv/versions/3.9.13/lib/python3.9/sre_parse.py", line 555, in _parse
    code1 = _class_escape(source, this)
  File "/Users/xxx/.pyenv/versions/3.9.13/lib/python3.9/sre_parse.py", line 350, in _class_escape
    raise source.error('bad escape %s' % escape, len(escape))
re.error: bad escape \u at position 1

1stmud avatar Jul 27 '22 07:07 1stmud

From https://docs.python.org/3/library/re.html:

'\u', '\U', and '\N' escape sequences are only recognized in Unicode patterns. In bytes patterns they are errors. Unknown escapes of ASCII letters are reserved for future use and treated as errors.

Everything in filter-repo is byte patterns, not strings. So you simply can't do this.

newren avatar Oct 09 '22 04:10 newren

So @newren to be clear, we cannot modify commit author names with non-ascii chars in them ? Like many european names with diacritics in them and non-latin alphabet names

Atralb avatar Oct 17 '23 07:10 Atralb

So @newren to be clear, we cannot modify commit author names with non-ascii chars in them ? Like many european names with diacritics in them and non-latin alphabet names

I didn't say that at all.

I merely said that you cannot run invalid python, and using '\u' on bytestrings is invalid python.

Now, it may not be as easy since you can't use some of the built-in facilities. But you could convert bytestrings to UTF-8 (assuming you know all author names in your repo are valid UTF-8), then run your regex, then get the raw bytes. You could use other python methods. But whatever you use needs to be valid python for the given input data.

newren avatar Oct 17 '23 07:10 newren

Thanks for your prompt and helpful answer !

Atralb avatar Oct 17 '23 07:10 Atralb