tldextract
tldextract copied to clipboard
Making tldextract serializable and faster with Trie dictionary
First of all, let me say I'm a huge fan of @john-kurkowski 's tldextract. I am find it to be critical in doing work with the common crawl dataset and other projects.
I have found, quite by accident, that the package is not serializable but I believe could be modified quite easily to do so. and by doing so, I think it could speed the lookup function by ~20% or so. Serializability could be important for big data projects using spark broadcast or other distributed processing beyond a single core.
here is what I'm seeing:
import json
ext = tldextract.TLDExtract()
ext._extractor.tlds_incl_private_trie
Out[14]: <tldextract.tldextract.Trie at 0x2305deb1840>
json.dumps(ext._extractor.tlds_incl_private_trie)
Traceback (most recent call last):
File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-15-d4e5d6e8c9ec>", line 1, in <module>
json.dumps(ext._extractor.tlds_incl_private_trie)
File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Trie is not JSON serializable
also:
import pickle
pickle.dumps(ext)
Traceback (most recent call last):
File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-17-188183203f90>", line 1, in <module>
pickle.dumps(extract)
_pickle.PicklingError: Can't pickle <function ext at 0x000002305F541480>: attribute lookup extract on __main__ failed
This seems to be because the underlying Trie is a custom class.
This could be resolved in several ways:
- Add a method to Trie class to tell it how to serialize/deserialize (a bit hack-ey in my opinion)
- Tell json or pickle how to serialize/deserialize. (again, a band-aid)
- Rewrite Trie class to be a standard dict (I think this is the best way a the dict would likely be faster - ~ 20%). ref(
An untested way to do this that would likely require no additional changes to the private calling class.
If this is of sufficient interest, I'd be glad to provide a PR.
Updated 10/11/24
class Trie(dict):
"""
alt trie for tldextrct using python dict class
"""
__getattr__ = dict.get # key for allowing calling functions to use dot attrib calls.
@staticmethod
def create(
match_list: Collection[str],
reverse: bool = False,
is_private=False
) -> 'Trie':
"""Create a Trie from a list of matches and return its root node."""
root_node = Trie()
for m in match_list:
root_node._add_match(m, is_private)
return root_node
def _add_match(self, match: str, reverse=False, is_private=False):
"""Append a suffix's labels to this Trie node."""
labels = match.split(".")
node = self
if reverse:
labels = reverse(labels)
for label in labels:
node = node.setdefault(label, {})
node['is_private'] = is_private