tldextract icon indicating copy to clipboard operation
tldextract copied to clipboard

How can I do 1.2 billion extractions faster?

Open marklit opened this issue 4 years ago • 15 comments

I recently used this library to extract the second-level domain names from the 1.2 billion PTR records in Rapid7's Sonar database. As an example it would extract totinternet from node-nnf.pool-1-1.dynamic.totinternet.net. I used a Pool of 16 threads on Intel Core i5 4670K @ 3.40 GHz and it finished in just over a day. I suspect the pool count might have been too high for four cores.

I'm trying to see if there are any optimisations that could be made to speed this 1.2B-record job up. I compared running the extractor on a single hostname one million times versus doing a million trivial string splitting and index operations. I know the two operations below do very different things but I'm trying to understand what the floor looks like performance-wise when dealing with these sorts of data types in Python.

from tldextract import extract as tld_ex

h = 'kd111239172113.au-net.ne.jp'

_ = [h.split('.')[:-3] for x in range(0, 1000000)] # 1.26 seconds
_ = [tld_ex(h) for x in range(0, 1000000)] # 23.4 seconds

For the edge-cases tldextract can support it looks to be operating very quickly already. If that 23.4 seconds per one million lookups where to be maintained it would still take ~8 hours to do 1.2 billion lookups.

I'm wondering if there is something that could speed up the library even further. There appears to be around 7K TLDs that are looked up against, is there a way the search space could be reduced here? Could more common TLDs be compared against sooner?

Would porting the library to CPython be feasible?

Is there some obvious optimisation I'm missing in this library's basic usage?

marklit avatar Sep 28 '19 15:09 marklit

Is there some obvious optimisation I'm missing in this library's basic usage?

I don't think so. This scale is uncharted territory. 😃

john-kurkowski avatar Sep 30 '19 16:09 john-kurkowski

I used a Pool of 16 threads

Can you share the rough code here? I would expect this in-memory, CPU-oriented splitting and traversal to parallelize well.

john-kurkowski avatar Sep 30 '19 16:09 john-kurkowski

This the code I ran.

from   itertools import chain, islice
from   multiprocessing import Pool
import socket
import struct
import sys
import uuid

from tldextract import extract as tld_ex


ip2int = lambda x: struct.unpack("!I", socket.inet_aton(x))[0]


def group(num_items, iterable):
    iterable = iter(iterable)

    while True:
        chunk = islice(iterable, num_items)

        try:
            first_element = next(chunk)
        except StopIteration:
            return

        yield chain((first_element,), chunk)


def extract(manifest):
    file_, line = manifest
    try:
        ip, domain = line.split(',')
        open(file_, 'a')\
            .write('%d,%s\n' % (ip2int(ip),
                                tld_ex(domain)[1]))
    except:
        pass


file_names = ['out.%s.csv' % str(uuid.uuid4())[:6]
              for _ in range(0, 16)]

pool = Pool(16)

while True:
    payload = [(file_names[x], sys.stdin.readline().strip())
               for x in range(0, 16)]
    pool.map(extract, payload)

marklit avatar Sep 30 '19 16:09 marklit

23.4 seconds per one million lookups … is there a way the search space could be reduced here? Could more common TLDs be compared against sooner?

I like these questions. Reviewing the core algorithm, I thought it was O(k), where k is the number of parts of the input string. I thought we'd be bounded by set lookups. To your point, maybe there are set lookups it shouldn't try? It could exit sooner?

Where is it spending its time in those 23.4s?

john-kurkowski avatar Sep 30 '19 16:09 john-kurkowski

That's a good question. I haven't run it through a flame graph or anything, I guess that would be a good place to start.

If it were the regex causing the most amount of load do you think it would be possible to have two TLD lists, one with all 7K TLDs and one with the top 50 or so? Would something like that cause a lot of edge case issues?

marklit avatar Sep 30 '19 16:09 marklit

See what the problem is first. Again, my thinking is it's input bound, not PSL length bound. We'll see.

I've had decent luck with Python's builtin profiler. It dumps a text spreadsheet. I'm not up to date on Python state of the art. A flame graph would be nice to share.

john-kurkowski avatar Oct 01 '19 17:10 john-kurkowski

I've managed to produce a flame graph, GitHub demands the SVG is ZIP-compressed in order to upload. I'll leave my setup notes that I used to produce this graph here in case its of any use. This was all run on Ubuntu 16.04.2 LTS.

$ sudo apt update
$ sudo apt install \
    autoconf \
    automake \
    autotools-dev \
    g++ \
    git \
    libtool \
    make \
    pkg-config \
    python-dev \
    virtualenv

$ virtualenv ~/.test
$ source ~/.test/bin/activate
$ pip install tldextract
$ vi ~/test.py
from tldextract import extract as tld_ex

h = 'kd111239172113.au-net.ne.jp'

_ = [tld_ex(h) for x in range(0, 1000000)]
$ git clone https://github.com/brendangregg/FlameGraph ~/flamegraph
$ git clone https://github.com/uber/pyflame ~/pyflame
$ cd ~/pyflame
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
$ cd ~
$ pyflame -s 10 -r 0.01 -o ~/perf.data -t python ~/test.py
$ ~/flamegraph/flamegraph.pl ~/perf.data > ~/perf.svg

marklit avatar Oct 02 '19 06:10 marklit

Hi!

I'm somewhat familiar with this process.. looking into things on my laptop here (2.7 GHz Intel Core i7) I get the following performance to start:

In [2]: %time _ = [h.split('.')[:-3] for x in range(0, 1000000)]
CPU times: user 801 ms, sys: 76 ms, total: 877 ms
Wall time: 878 ms

In [3]: %time _=[tld_ex(h) for x in range(0, 1000000)]
CPU times: user 7.66 s, sys: 158 ms, total: 7.82 s
Wall time: 7.85 s

Writing the following function, to remove the extra processing related to urls and punycode:

from tldextract import TLDExtract
def ex(netloc, _ex=TLDExtract()._get_tld_extractor()):
    labels = netloc.split(".")
    suffix_index = _ex.suffix_index(labels)
    suffix = ".".join(labels[suffix_index:])
    subdomain = ".".join(labels[:suffix_index - 1]) if suffix_index else ""
    domain = labels[suffix_index - 1] if suffix_index else ""
    return subdomain,domain,suffix

gives

In [1]: %time _ = [ex(h) for x in range(0, 1000000)]
CPU times: user 3.23 s, sys: 99.8 ms, total: 3.33 s
Wall time: 3.34 s

so, half the time gone, but still almost 4x slower.

When I have implemented this in the past, I have used a trie type structure mapping the tlds from right to left. so visually something like

com
uk
  co
  ac
  ..
jp
  ne
  ..
..

so, if you are parsing the domain from right to left, if you see a 'com' you hit the end of the tree and know you have a tld, but if you see 'uk', you need to check to see if the next part is 'co' or 'ac' etc.

This looks something like this:

import collections
def tree():
    return collections.defaultdict(tree)

class _PublicSuffixListTLDExtractor(object):
    """Wrapper around this project's main algo for PSL
    lookups.
    """

    def __init__(self, tlds):
        trie = tree()
        for tld in tlds:
            loc = trie
            for part in reversed(tld.split(".")):
                part = part.lstrip('*!.')
                loc = loc[part]
        self.trie = trie

    def suffix_index(self, lower_spl):
        """Returns the index of the first suffix label.
        Returns len(spl) if no suffix is found
        """
        length = len(lower_spl)
        loc = self.trie
        for i in range(length-1, -1, -1):
            key = lower_spl[i]
            if key not in loc:
                return i+1
            loc = loc[key]

        return length

I basically ignored those *. or ! stuff for now, I don't think it is hard to handle but I'm not familiar with those.

re-running the test gives:

In [6]: %time _ = [ex(h) for x in range(0, 1000000)]
CPU times: user 1.76 s, sys: 90.6 ms, total: 1.86 s
Wall time: 1.86 s

This gives the biggest win for long domain names. They are still both O(k), but going from right to left means k ends up being 1 or 2 in the most common cases.

before

In [2]: %time _ = [ex("a.really.long.domain.name.with.a.lot.of.parts.com") for x in range(0, 1000000)]
CPU times: user 9.92 s, sys: 152 ms, total: 10.1 s
Wall time: 10.1 s

after:

In [1]: %time _ = [ex("a.really.long.domain.name.with.a.lot.of.parts.com") for x in range(0, 1000000)]
CPU times: user 1.88 s, sys: 108 ms, total: 1.99 s
Wall time: 2 s

JustinAzoff avatar Jan 02 '20 17:01 JustinAzoff

a.really.long.domain.name.with.a.lot.of.parts.com

That made it click for me. Yes, a trie going right to left would help for these pathological inputs (yet common inputs for certain use cases). I like that the optimization is not much more code.

john-kurkowski avatar Jan 04 '20 00:01 john-kurkowski

I also like the clear stats. Thank you for that effort. I didn't understand the flame graph. Nor my own cProfile spelunking. (The best I could blame was the years-old, obtuse chain of string manipulation, near the beginning of processing 1 input. It's a bit wasteful. I dreaded making it more obtuse in the name of maybe speeding up the OP's case. I guess that's still on the table, but we'll cross that bridge if we come to it.)

john-kurkowski avatar Jan 04 '20 01:01 john-kurkowski

And if one were hesitant about the change to a trie, another optimization against this pathological case would be to ignore any levels beyond the max level in the PSL. At time of writing, that appears to be 5 levels, experimentally determined.

❯ rg '^[^/].*\..*\..*\..*\.' public_suffix_list.dat
10688:*.compute.amazonaws.com.cn
10693:cn-north-1.eb.amazonaws.com.cn
10694:cn-northwest-1.eb.amazonaws.com.cn
10717:*.elb.amazonaws.com.cn
10741:s3.cn-north-1.amazonaws.com.cn
10747:s3.dualstack.ap-northeast-1.amazonaws.com
10748:s3.dualstack.ap-northeast-2.amazonaws.com
10749:s3.dualstack.ap-south-1.amazonaws.com
10750:s3.dualstack.ap-southeast-1.amazonaws.com
10751:s3.dualstack.ap-southeast-2.amazonaws.com
10752:s3.dualstack.ca-central-1.amazonaws.com
10753:s3.dualstack.eu-central-1.amazonaws.com
10754:s3.dualstack.eu-west-1.amazonaws.com
10755:s3.dualstack.eu-west-2.amazonaws.com
10756:s3.dualstack.eu-west-3.amazonaws.com
10757:s3.dualstack.sa-east-1.amazonaws.com
10758:s3.dualstack.us-east-1.amazonaws.com
10759:s3.dualstack.us-east-2.amazonaws.com
11780:app.os.stg.fedoraproject.org
12185:users.scale.virtualcloud.com.br
12193:it1.eur.aruba.jenv-aruba.cloud
12240:cloud.jelastic.open.tim.it

~
❯ rg '^[^/].*\..*\..*\..*\..*\.' public_suffix_list.dat

~

john-kurkowski avatar Dec 18 '20 20:12 john-kurkowski

Note that even for smaller domains, tldextract seems to be significantly slower than the others I've tested, publicsuffix2 and publicsuffixlist:

from timeit import timeit

def test(length):
    domain = 'a.' * length + 'net.uy'
    tlde = timeit(f"psl('{domain}')", setup="from tldextract import TLDExtract; psl=TLDExtract(suffix_list_urls=None)")
    ps2 = timeit(f"psl.get_public_suffix('{domain}')", setup="from publicsuffix2 import PublicSuffixList; psl=PublicSuffixList()")
    psl = timeit(f"psl.privatesuffix('{domain}')", setup="from publicsuffixlist import PublicSuffixList; psl=PublicSuffixList()")
    print(f"{length}: tlde {tlde:.2} / ps2 {ps2:.2} / psl {psl:.2} ({domain})")

for i in range(7):
    test(i)

results in:

0: tlde 3.0 / ps2 2.8 / psl 1.0 (net.uy)
1: tlde 4.1 / ps2 3.0 / psl 1.8 (a.net.uy)
2: tlde 4.9 / ps2 3.0 / psl 2.4 (a.a.net.uy)
3: tlde 5.7 / ps2 3.1 / psl 3.0 (a.a.a.net.uy)
4: tlde 6.7 / ps2 4.4 / psl 4.0 (a.a.a.a.net.uy)
5: tlde 8.8 / ps2 3.4 / psl 3.4 (a.a.a.a.a.net.uy)
6: tlde 8.7 / ps2 3.6 / psl 3.6 (a.a.a.a.a.a.net.uy)

(That's in second for 1 million invocations - also, thanks to @lufte for the original test code over in https://github.com/qutebrowser/qutebrowser/issues/756#issuecomment-634006005)

The-Compiler avatar Jan 13 '21 10:01 The-Compiler

I've found a library that uses the trie data structure like what @JustinAzoff did. Might be useful for any future attempts to improve tldextract performance.

elliotwutingfeng avatar Apr 03 '22 04:04 elliotwutingfeng

Some of the discussed optimizations were contributed in #283 (and more in #284) and released in 3.4.2. I don't know if it addresses this whole ticket yet, but give it a try! Thank you @elliotwutingfeng!

john-kurkowski avatar May 16 '23 19:05 john-kurkowski

Thanks @john-kurkowski for the merges, I've re-run the benchmarks by @The-Compiler.

For typical use cases, the performance gap between tldextract and the other libraries is much narrower now.

Specs: Linux x64, Ryzen 7 5800X

Python 3.10

0: tlde 1.7 / ps2 1.6 / psl 0.6 (net.uy)
1: tlde 1.9 / ps2 1.8 / psl 1.0 (a.net.uy)
2: tlde 2.0 / ps2 1.8 / psl 1.3 (a.a.net.uy)
3: tlde 2.0 / ps2 1.8 / psl 1.7 (a.a.a.net.uy)
4: tlde 2.1 / ps2 1.8 / psl 1.9 (a.a.a.a.net.uy)
5: tlde 2.1 / ps2 1.9 / psl 1.8 (a.a.a.a.a.net.uy)
6: tlde 2.1 / ps2 1.9 / psl 1.9 (a.a.a.a.a.a.net.uy)

Interestingly, tldextract now outperforms the other libraries on PyPy for long FQDNs. Unless the results are being cached by PyPy, not sure about that...

PyPy 7.3.11 with GCC 12.2.0 (Python 3.9)

0: tlde 0.22 / ps2 0.28 / psl 0.11 (net.uy)
1: tlde 0.24 / ps2 0.31 / psl 0.29 (a.net.uy)
2: tlde 0.29 / ps2 0.33 / psl 0.44 (a.a.net.uy)
3: tlde 0.28 / ps2 0.36 / psl 0.6 (a.a.a.net.uy)
4: tlde 0.33 / ps2 0.38 / psl 0.68 (a.a.a.a.net.uy)
5: tlde 0.31 / ps2 0.39 / psl 0.68 (a.a.a.a.a.net.uy)
6: tlde 0.36 / ps2 0.4 / psl 0.69 (a.a.a.a.a.a.net.uy)

elliotwutingfeng avatar May 17 '23 19:05 elliotwutingfeng

I haven't heard further on this issue, so I'll call it a win.

john-kurkowski avatar Mar 01 '24 23:03 john-kurkowski