PyDomainExtractor icon indicating copy to clipboard operation
PyDomainExtractor copied to clipboard

You can't compare this with tldextract. Tldextract extracts more data correctly while This domain scanner can't

Open vihaanmody1 opened this issue 2 years ago • 4 comments

Tldextract extracts ips and http schemes with url while this extractor can't. The speed doesn't matter in this case. What matters is the correctness of the data scraped.

vihaanmody1 avatar Aug 13 '22 13:08 vihaanmody1

Thank you for your comment, @Vihaanmody21. TLDextract is a reliable library that performs well. Additionally, it supports a few more use-cases than this library. When we extract billions of domains a day for our internal use-case, performance is crucial. Taking action on our security product in a timely manner is crucial.

In our library, we aim to destruct domains into their constituent parts, and nothing else.

To address the correctness argument, I would love to get as much data as possible that points to the issues. I will do everything I can to resolve it.

Thank you!

wavenator avatar Aug 16 '22 06:08 wavenator

Hello @wavenator

PyDomainExtracter is a great tool for extracting. But if an URL has a scheme in it, it won't work. While TLDextract can extract URLS with schemes and IPs.

vihaanmody1 avatar Dec 28 '22 22:12 vihaanmody1

I completely agree with your viewpoint. TLDExtract is an excellent library designed to extract domains from diverse sources and data formats, which is not within our scope to address.

Regarding the schemes and IPs, could you please provide some examples of the ones you would like us to extract but are currently not supported? This way, we can keep track of them and consider incorporating them in the future.

wavenator avatar Jun 25 '23 13:06 wavenator

Here is a workaround wrapper function that can handle URLs with and without scheme, with and without port/path (at the cost of slower execution time), similar to tldextract.

import pydomainextractor
import tldextract  # for benchmarks later

pde = pydomainextractor.DomainExtractor()

def extract(s):
    if pde.is_valid_domain(s):
        return pde.extract(s)
    try:
        return pde.extract_from_url(s)
    except ValueError:
        return pde.extract_from_url("//" + s)

print(extract("https://a.b.c.example.com.sg"))
print(extract("https://a.b.c.example.com.sg:5000"))
print(extract("https://a.b.c.example.com.sg/path"))
print(extract("https://a.b.c.example.com.sg:5000/path"))

print(extract("a.b.c.example.com.sg"))
print(extract("a.b.c.example.com.sg:5000"))
print(extract("a.b.c.example.com.sg/path"))
print(extract("a.b.c.example.com.sg:5000/path"))

# {'suffix': 'com.sg', 'domain': 'example', 'subdomain': 'a.b.c'}

Benchmarks

  • A Rust-based parser vastly outperforms a pure Python parser in CPython. However, for handling ambiguous input, tldextract can match the speed of pydomainextractor when on PyPy.
  • It should be noted that the wrapper function attempts to parse a.b.c.example.com.sg:5000/a/b/c (schemeless, but with path) twice, hence the abnormally slow timing of 631ns. Further optimizations on the Rust-side can possibly eliminate this bottleneck.
  • pydomainextractor does not work with IPv4 or IPv6 addresses, while tldextract handles both.
  • pydomainextractor doesn't perform well on PyPy.
pydomainextractor
%timeit extract("https://a.b.c.example.com.sg:5000/a/b/c")
%timeit extract("a.b.c.example.com.sg:5000/a/b/c")
%timeit extract("a.b.c.example.com.sg")

# CPython 3.11.7
# 277 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 631 ns ± 8.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 411 ns ± 2.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# PyPy 3.9.18
# 3.63 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 3.84 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 3.56 µs ± 258 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
tldextract
%timeit tldextract.extract("https://a.b.c.example.com.sg:5000/a/b/c")
%timeit tldextract.extract("a.b.c.example.com.sg:5000/a/b/c")
%timeit tldextract.extract("a.b.c.example.com.sg")

# CPython 3.11.7
# 2.13 µs ± 8.73 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 1.79 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 1.78 µs ± 7.46 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# PyPy 3.9.18
# 338 ns ± 1.49 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 231 ns ± 0.647 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 232 ns ± 2.52 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

elliotwutingfeng avatar Feb 12 '24 08:02 elliotwutingfeng