clean-text URLs are not matched

URLs are not matched

Open lemon234071 opened this issue 3 years ago • 1 comments

text = "郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈http://t.c"
cleantext.replace_urls(text, "XXX")

output:

郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈哈http://t.c

Expected:

郭麒麟打卡,且听他分享防疫小知识XXX哈哈XXX哈哈哈哈XXX

Apr 06 '21 12:04 lemon234071

Hey @lemon234071, thanks for reporting. I'm not sure how to handle this. Right now, the URL has to be somehow separated from other tokes (e.g. by a preceding space). In your string, the URLs could be detected by using the ASCII characters in the string. Maybe this can be useful to add a special handling for Chinese texts? I would not adapt the current URL regex for English (etc.). https://github.com/jfilter/clean-text/blob/master/cleantext/constants.py#L62

Aug 26 '21 00:08 jfilter

clean-text clean-text copied to clipboard

URLs are not matched

clean-text
clean-text copied to clipboard