URLExtract
URLExtract copied to clipboard
URL Detection Problem
I tried to use the module to detect a link in the following string.
Link:https://www.google.com
but it failed to detect that there is a url.
Hi @Ricardolcm888 thanks for reporting it.
However this is not an easily fixed issue. The text is not typographically correct, there should be space after the first colon sign: Link: https://example.com
And yeah I know - internet is full of these mistakes and typos.
Is it possible for you to somehow pre-process the text?
I will think about it, however right now I do not see any general solution for this.
Yes this is in fact a problem
from urlextract import URLExtract
extractor = URLExtract()
extractor.find_urls('earn $600 every week, work from home job:https://2.ua/YHfw38')
results in:
["job:https://2.ua/YHfw38"]
@lipoja RE:
The text is not typographically correct, there should be space after the first colon sign
well, what should be and what is, sadly rarely coincide 🤣
I had to fix this for my ML pre-processing, pretty straightforward fix, will submit a PR shortly...
Here is the PR https://github.com/lipoja/URLExtract/pull/120
@lipoja I would appreciate if you could merge that in and release so I do not have to release a production model off of my code change in a form of a hack in a git branch :)
@amoldavsky Thank you for contributing! Sure I can merge it and release it. But before I do that I would like to discuss with you few of my ideas so we do not break extraction or unintentionally filter out some URLs which would be extracted with current code. Please have a look to your PR.
Yup, I started a discussion in the PR
facing the same issue, was curious what is the state of PR for fixing this! :)