URLExtract icon indicating copy to clipboard operation
URLExtract copied to clipboard

Doesn't checks for valid termination

Open ankitxjoshi opened this issue 6 years ago • 5 comments

For the following input:

from urlextract import URLExtract

extractor = URLExtract()
text="""
http://httpbin.org/status/204, http://httpbin.org/status/204.
"""
urls = extractor.find_urls(text)
print(urls)

The output generated is: ['http://httpbin.org/status/204,', 'http://httpbin.org/status/204.']

The set [.,?!-] are not valid terminal symbols for the url and thus should be checked.

ankitxjoshi avatar Apr 09 '18 05:04 ankitxjoshi

Hi thank you for this note. I will think about it. First what I thought is that those characters can be at the end of URL if the URL has query part. http://example.com/status?bracket=[ Or am I wrong? Maybe valid URL should be encoded with % notation but this human readable form of URL you can find in any text.

So it means add more logic and check for these end characters only if the URL does not have query part.

lipoja avatar Apr 09 '18 06:04 lipoja

Oh [] were just meant to enclose the characters. The invalid symbols are ".,?!-" (Quotes not included). Sorry for the misunderstanding. This is as per my research done. Could be wrong :sweat_smile:

ankitxjoshi avatar Apr 09 '18 06:04 ankitxjoshi

OK, thanks. I will look to all your reported issues.

lipoja avatar Apr 09 '18 06:04 lipoja

Hi @MacBox7, sorry for such a big delay :(
Could you please help me wit this issue. Especially with the part where it is defined that I can not use those symbols as termination characters? I've read the RFC3986 and I think it is not there specified. Maybe I missed something?

I think that I am still not able to say, from the example above if ',' or '.' characters should or should not be part of the URL.

Thank you!

lipoja avatar Aug 29 '18 18:08 lipoja

In the meantime, one can hack (at least commas) via

    u = URLExtract()
    u._stop_chars_right |= {','}
    u._stop_chars_left  |= {','}

Perhaps sensible default is treating unconventional special characters as forbidden in url and adding a nicer constructor argument to allow to configure that if someone really wants them in URL?

karlicoss avatar Feb 26 '19 22:02 karlicoss