tldextract icon indicating copy to clipboard operation
tldextract copied to clipboard

Making tldextract serializable and faster with Trie dictionary

Open leeprevost opened this issue 5 months ago • 5 comments

First of all, let me say I'm a huge fan of @john-kurkowski 's tldextract. I am find it to be critical in doing work with the common crawl dataset and other projects.

I have found, quite by accident, that the package is not serializable but I believe could be modified quite easily to do so. and by doing so, I think it could speed the lookup function by ~20% or so. Serializability could be important for big data projects using spark broadcast or other distributed processing beyond a single core.

here is what I'm seeing:

import json
ext = tldextract.TLDExtract()
ext._extractor.tlds_incl_private_trie
Out[14]: <tldextract.tldextract.Trie at 0x2305deb1840>
json.dumps(ext._extractor.tlds_incl_private_trie)
Traceback (most recent call last):
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-d4e5d6e8c9ec>", line 1, in <module>
    json.dumps(ext._extractor.tlds_incl_private_trie)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Trie is not JSON serializable

also:

import pickle
pickle.dumps(ext)
Traceback (most recent call last):
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-188183203f90>", line 1, in <module>
    pickle.dumps(extract)
_pickle.PicklingError: Can't pickle <function ext at 0x000002305F541480>: attribute lookup extract on __main__ failed

This seems to be because the underlying Trie is a custom class.

This could be resolved in several ways:

  1. Add a method to Trie class to tell it how to serialize/deserialize (a bit hack-ey in my opinion)
  2. Tell json or pickle how to serialize/deserialize. (again, a band-aid)
  3. Rewrite Trie class to be a standard dict (I think this is the best way a the dict would likely be faster - ~ 20%). ref(

An untested way to do this that would likely require no additional changes to the private calling class.

If this is of sufficient interest, I'd be glad to provide a PR.

Updated 10/11/24

class Trie(dict):
    """
    alt trie for tldextrct using python dict class
    """
    __getattr__ = dict.get  # key for allowing calling functions to use dot attrib calls.

    


    @staticmethod
    def create(
            match_list: Collection[str],
            reverse: bool = False, 
            is_private=False


    ) -> 'Trie':
        """Create a Trie from a list of matches and return its root node."""
        root_node = Trie()

        for m in match_list:
            root_node._add_match(m, is_private)

        return root_node

    def _add_match(self, match: str, reverse=False, is_private=False):
        """Append a suffix's labels to this Trie node."""
        labels = match.split(".")
        node = self
        if reverse:
              labels = reverse(labels)

        for label in labels:
            node = node.setdefault(label, {})
        node['is_private'] =  is_private

leeprevost avatar Sep 16 '24 16:09 leeprevost