tldextract
tldextract copied to clipboard
Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired
I start getting this error when I increase the number of processes / threads to a certain point.
Is there a way to increase the timeout value?
More importantly, why is lock needed here if tldextract isn't writing anything, only reading?
It's fetching and saving the latest version of the top level domains list.
A lock is to prevent multiple threads and processes from needlessly requesting the data and then contending as they write the data to the same location.
Timeout is currently set to 20 seconds. https://github.com/john-kurkowski/tldextract/blob/40205f67df5f59df4b88ce47bbbe98f1eff36230/tldextract/cache.py#L78
I'd suggest either disabling the list update or doing it beforehand and then disabling it. See the readme for details.
I did look at it but it's not too clear to me whether cache_dir=False
disables writing to the cache (downloading new info) vs reading from the cache (fetching directly from the internet) in these examples:
# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')
# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')
I don't feel a need for any custom path so in which order would you run tldextract.TLDExtract()
and tldextract.TLDExtract(cache_dir=False)
?
If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)
I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.
If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)
Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.