URLExtract icon indicating copy to clipboard operation
URLExtract copied to clipboard

URL Detection Problem

Open Ricardolcm888 opened this issue 4 years ago • 5 comments

I tried to use the module to detect a link in the following string.

Link:https://www.google.com

but it failed to detect that there is a url.

Ricardolcm888 avatar Jan 11 '21 00:01 Ricardolcm888

Hi @Ricardolcm888 thanks for reporting it. However this is not an easily fixed issue. The text is not typographically correct, there should be space after the first colon sign: Link: https://example.com And yeah I know - internet is full of these mistakes and typos. Is it possible for you to somehow pre-process the text?

I will think about it, however right now I do not see any general solution for this.

lipoja avatar Jan 11 '21 11:01 lipoja

Yes this is in fact a problem

from urlextract import URLExtract

extractor = URLExtract()
extractor.find_urls('earn $600 every week, work from home job:https://2.ua/YHfw38')

results in:

["job:https://2.ua/YHfw38"]

@lipoja RE: The text is not typographically correct, there should be space after the first colon sign well, what should be and what is, sadly rarely coincide 🤣

I had to fix this for my ML pre-processing, pretty straightforward fix, will submit a PR shortly...

amoldavsky avatar Mar 14 '22 02:03 amoldavsky

Here is the PR https://github.com/lipoja/URLExtract/pull/120

@lipoja I would appreciate if you could merge that in and release so I do not have to release a production model off of my code change in a form of a hack in a git branch :)

amoldavsky avatar Mar 14 '22 03:03 amoldavsky

@amoldavsky Thank you for contributing! Sure I can merge it and release it. But before I do that I would like to discuss with you few of my ideas so we do not break extraction or unintentionally filter out some URLs which would be extracted with current code. Please have a look to your PR.

lipoja avatar Mar 15 '22 09:03 lipoja

Yup, I started a discussion in the PR

amoldavsky avatar Mar 16 '22 15:03 amoldavsky

facing the same issue, was curious what is the state of PR for fixing this! :)

Stvad avatar Nov 30 '22 16:11 Stvad