URLExtract
URLExtract copied to clipboard
Passing custom cache_dir doesnt seem to actually save the tlds...txt file in that dir
(venv) yossi@ubuntu7:~/testing$ python --version
Python 3.10.2
(venv) yossi@ubuntu7:~/testing$ pip list
Package Version
------------ -------
filelock 3.6.0
idna 3.3
pip 22.0.3
platformdirs 2.5.1
setuptools 58.1.0
uritools 4.0.0
urlextract 1.5.0
(venv) yossi@ubuntu7:~/testing$ more test.py
from urlextract import URLExtract
import logging
logging.basicConfig(format='%(asctime)s - %(levelname)s\n%(message)s', level=logging.INFO)
extractor = URLExtract(cache_dir='.')
extractor.update() # same results with or without this line
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']
(venv) yossi@ubuntu7:~/testing$ python test.py
2022-02-23 23:50:02,092 - INFO
Cache file not found in './tlds-alpha-by-domain.txt'. Use URLExtract.update() to download newest version.
2022-02-23 23:50:02,093 - INFO
Using default list of TLDs provided in urlextract package.
['janlipovsky.cz']
(venv) yossi@ubuntu7:~/testing$ ls -la
total 20
drwxrwxr-x 3 yossi yossi 4096 Feb 23 23:48 .
drwxr-xr-x 120 yossi yossi 4096 Feb 23 23:43 ..
-rw-rw-r-- 1 yossi yossi 357 Feb 23 23:48 test.py
-rwxrwxr-x 1 yossi yossi 0 Feb 23 23:50 tlds-alpha-by-domain.txt.lock
drwxrwxr-x 5 yossi yossi 4096 Feb 23 23:47 venv