PyDomainExtractor icon indicating copy to clipboard operation
PyDomainExtractor copied to clipboard

extract_from_url should handle url without protocol-scheme

Open nsteinberg-r7 opened this issue 4 years ago • 3 comments

How to reproduce: call extract_from_url with //mail.google.com/mail as input

result will be Invalid Domain Error expected behavior is to handle the case of missing protocol and return {subdomain: mail, domain: google, suffix: com}

nsteinberg-r7 avatar Sep 16 '20 09:09 nsteinberg-r7

Technically this is not a valid URI but a URI reference. https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#URI_references

We can support it, but not without a // at the beginning to distinguish between a valid and invalid URIs

wavenator avatar Sep 22 '20 13:09 wavenator

Tldextract can extract with schemes

image

vihaanmody1 avatar Dec 28 '22 22:12 vihaanmody1

It appears that extract_from_url("//mail.google.com/mail") now works.

elliotwutingfeng avatar Feb 12 '24 08:02 elliotwutingfeng