URLExtract
URLExtract copied to clipboard
Doesn't checks for valid termination
For the following input:
from urlextract import URLExtract
extractor = URLExtract()
text="""
http://httpbin.org/status/204, http://httpbin.org/status/204.
"""
urls = extractor.find_urls(text)
print(urls)
The output generated is:
['http://httpbin.org/status/204,', 'http://httpbin.org/status/204.']
The set [.,?!-]
are not valid terminal symbols for the url and thus should be checked.
Hi thank you for this note. I will think about it.
First what I thought is that those characters can be at the end of URL if the URL has query part.
http://example.com/status?bracket=[
Or am I wrong?
Maybe valid URL should be encoded with % notation but this human readable form of URL you can find in any text.
So it means add more logic and check for these end characters only if the URL does not have query part.
Oh [] were just meant to enclose the characters. The invalid symbols are ".,?!-" (Quotes not included). Sorry for the misunderstanding. This is as per my research done. Could be wrong :sweat_smile:
OK, thanks. I will look to all your reported issues.
Hi @MacBox7,
sorry for such a big delay :(
Could you please help me wit this issue. Especially with the part where it is defined that I can not use those symbols as termination characters? I've read the RFC3986 and I think it is not there specified. Maybe I missed something?
I think that I am still not able to say, from the example above if ',' or '.' characters should or should not be part of the URL.
Thank you!
In the meantime, one can hack (at least commas) via
u = URLExtract()
u._stop_chars_right |= {','}
u._stop_chars_left |= {','}
Perhaps sensible default is treating unconventional special characters as forbidden in url and adding a nicer constructor argument to allow to configure that if someone really wants them in URL?