clean-text
clean-text copied to clipboard
URLs are not matched
text = "郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈http://t.c"
cleantext.replace_urls(text, "XXX")
output:
郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈哈http://t.c
Expected:
郭麒麟打卡,且听他分享防疫小知识XXX哈哈XXX哈哈哈哈XXX
Hey @lemon234071, thanks for reporting. I'm not sure how to handle this. Right now, the URL has to be somehow separated from other tokes (e.g. by a preceding space). In your string, the URLs could be detected by using the ASCII characters in the string. Maybe this can be useful to add a special handling for Chinese texts? I would not adapt the current URL regex for English (etc.). https://github.com/jfilter/clean-text/blob/master/cleantext/constants.py#L62