python-publicsuffix2 icon indicating copy to clipboard operation
python-publicsuffix2 copied to clipboard

pages.example.com suffix declaration causes get_tld('example.com') == 'example.com'

Open jmehnle opened this issue 4 years ago • 6 comments

publicsuffix2 mishandles the case where, given the declaration of some public suffix, all suffixes of that suffix are seen as their own TLDs. E.g., given the declaration of git-pages.rit.edu as a public suffix, get_tld('rit.edu') returns 'rit.edu', whereas it really should return 'edu':

Python 3.7.7 (default, Mar 14 2020, 02:39:38)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from publicsuffix2 import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.get_tld("foo.git-pages.rit.edu")
'git-pages.rit.edu'  # CORRECT
>>> psl.get_tld("git-pages.rit.edu")
'git-pages.rit.edu'  # WRONG, should be 'edu'
>>> psl.get_tld("rit.edu")
'rit.edu'            # WRONG, should be 'edu'
>>> psl.get_tld("edu")
'edu'                # CORRECT, but probably out of accident

jmehnle avatar Jul 15 '20 17:07 jmehnle

I tried to understand the _lookup_node method and fix the issue to create a PR, but haven't been successful in the limited time I have right now.

jmehnle avatar Jul 15 '20 18:07 jmehnle

@jmehnle Thanks for the report ! @hiratara @KnitCode what's your take on this case? Here the PSL has this entry:

// Rochester Institute of Technology : http://www.rit.edu/
// Submitted by Jennifer Herting <[email protected]>
git-pages.rit.edu

pombredanne avatar Jul 16 '20 12:07 pombredanne

Just to be clear, this problem is more general than just the git-pages.rit.edu suffix. It will happen with any suffix (here: git-pages.rit.edu) that is an indirect subdomain of another suffix (here: edu): any intermediate domains (here: rit.edu) will erroneously be recognized as their own TLD when really that other suffix (here: edu) should be returned as the TLD instead.

jmehnle avatar Jul 16 '20 12:07 jmehnle

I think it's my fault. I made the same mistake with the rust library.

https://github.com/rushmorem/publicsuffix/issues/24

$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from publicsuffix2 import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.get_tld("cdn.fbsbx.com")
'fbsbx.com'    # WRONG

psl.get_tld("git-pages.rit.edu") 'git-pages.rit.edu' # WRONG, should be 'edu'

I believe this behavior is correct. psl produces the same result.

$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import psl
>>> psl.domain_suffixes("git-pages.rit.edu").public
'git-pages.rit.edu'
>>> psl.domain_suffixes("rit.edu").public
'edu'
>>> psl.domain_suffixes("edu").public
'edu'
>>> psl.domain_suffixes("cdn.fbsbx.com").public
'com'

hiratara avatar Jul 17 '20 07:07 hiratara

We also have to consider with platform.sh problem .

This ticket insists that the publicsuffix of rit.edu should be edu, and I am. So what should the publicsuffix of kobe.jp be? Our test insists that it should be "kobe.jp".

Here is the result of psl:

>>> psl.domain_suffixes("kobe.jp").public
'jp'
>>> psl.domain_suffixes("x.kobe.jp").public
'x.kobe.jp'
>>> psl.domain_suffixes("city.kobe.jp").public
'kobe.jp'

I think it's a good idea to make the same result as the psl.

hiratara avatar Jul 18 '20 08:07 hiratara

I`m trying to fix the issue with https://github.com/nexB/python-publicsuffix2/pull/19 .

hiratara avatar Jul 19 '20 12:07 hiratara