python-publicsuffix2
python-publicsuffix2 copied to clipboard
pages.example.com suffix declaration causes get_tld('example.com') == 'example.com'
publicsuffix2
mishandles the case where, given the declaration of some public suffix, all suffixes of that suffix are seen as their own TLDs. E.g., given the declaration of git-pages.rit.edu
as a public suffix, get_tld('rit.edu')
returns 'rit.edu'
, whereas it really should return 'edu'
:
Python 3.7.7 (default, Mar 14 2020, 02:39:38)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from publicsuffix2 import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.get_tld("foo.git-pages.rit.edu")
'git-pages.rit.edu' # CORRECT
>>> psl.get_tld("git-pages.rit.edu")
'git-pages.rit.edu' # WRONG, should be 'edu'
>>> psl.get_tld("rit.edu")
'rit.edu' # WRONG, should be 'edu'
>>> psl.get_tld("edu")
'edu' # CORRECT, but probably out of accident
I tried to understand the _lookup_node
method and fix the issue to create a PR, but haven't been successful in the limited time I have right now.
@jmehnle Thanks for the report ! @hiratara @KnitCode what's your take on this case? Here the PSL has this entry:
// Rochester Institute of Technology : http://www.rit.edu/
// Submitted by Jennifer Herting <[email protected]>
git-pages.rit.edu
Just to be clear, this problem is more general than just the git-pages.rit.edu
suffix. It will happen with any suffix (here: git-pages.rit.edu
) that is an indirect subdomain of another suffix (here: edu
): any intermediate domains (here: rit.edu
) will erroneously be recognized as their own TLD when really that other suffix (here: edu
) should be returned as the TLD instead.
I think it's my fault. I made the same mistake with the rust library.
https://github.com/rushmorem/publicsuffix/issues/24
$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from publicsuffix2 import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.get_tld("cdn.fbsbx.com")
'fbsbx.com' # WRONG
psl.get_tld("git-pages.rit.edu") 'git-pages.rit.edu' # WRONG, should be 'edu'
I believe this behavior is correct. psl produces the same result.
$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import psl
>>> psl.domain_suffixes("git-pages.rit.edu").public
'git-pages.rit.edu'
>>> psl.domain_suffixes("rit.edu").public
'edu'
>>> psl.domain_suffixes("edu").public
'edu'
>>> psl.domain_suffixes("cdn.fbsbx.com").public
'com'
We also have to consider with platform.sh problem .
This ticket insists that the publicsuffix of rit.edu
should be edu
, and I am. So what should the publicsuffix of kobe.jp
be? Our test insists that it should be "kobe.jp"
.
Here is the result of psl
:
>>> psl.domain_suffixes("kobe.jp").public
'jp'
>>> psl.domain_suffixes("x.kobe.jp").public
'x.kobe.jp'
>>> psl.domain_suffixes("city.kobe.jp").public
'kobe.jp'
I think it's a good idea to make the same result as the psl.
I`m trying to fix the issue with https://github.com/nexB/python-publicsuffix2/pull/19 .