PyDomainExtractor extract_from_url should handle url without protocol-scheme

extract_from_url should handle url without protocol-scheme

Open nsteinberg-r7 opened this issue 4 years ago • 3 comments

How to reproduce: call extract_from_url with //mail.google.com/mail as input

result will be Invalid Domain Error expected behavior is to handle the case of missing protocol and return {subdomain: mail, domain: google, suffix: com}

Sep 16 '20 09:09 nsteinberg-r7

Technically this is not a valid URI but a URI reference. https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#URI_references

We can support it, but not without a // at the beginning to distinguish between a valid and invalid URIs

Sep 22 '20 13:09 wavenator

Tldextract can extract with schemes

Dec 28 '22 22:12 vihaanmody1

It appears that extract_from_url("//mail.google.com/mail") now works.

Feb 12 '24 08:02 elliotwutingfeng

PyDomainExtractor PyDomainExtractor copied to clipboard

extract_from_url should handle url without protocol-scheme

PyDomainExtractor
PyDomainExtractor copied to clipboard