PyDomainExtractor
PyDomainExtractor copied to clipboard
extract_from_url should handle url without protocol-scheme
How to reproduce:
call extract_from_url
with //mail.google.com/mail
as input
result will be Invalid Domain Error
expected behavior is to handle the case of missing protocol and return {subdomain: mail, domain: google, suffix: com}
Technically this is not a valid URI but a URI reference. https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#URI_references
We can support it, but not without a //
at the beginning to distinguish between a valid and invalid URIs
Tldextract can extract with schemes
It appears that extract_from_url("//mail.google.com/mail")
now works.