TLDExtract icon indicating copy to clipboard operation
TLDExtract copied to clipboard

Parser bug when subdomain has "-"

Open NizarBlond opened this issue 6 years ago • 1 comments

The parser fails for the following:

// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';

// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);

parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null 

NizarBlond avatar Mar 01 '19 16:03 NizarBlond

I don't believe this is an issue with hyphens, it's an issue with S3 domains.

s3-ap-southeast-2.amazonaws.com is defined as a private domain - https://github.com/publicsuffix/list/blob/master/public_suffix_list.dat#L10747

Once you parse the S3 domain you end up with:

subdomain: null
hostname: s3-ap-southeast-2.amazonaws.com
suffix: null

So you could use $domainParser->getHostname().

If you don't care about private domains you can do this:

$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';

// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract(null, null, \LayerShifter\TLDExtract\Extract::MODE_ALLOW_ICCAN);
$domainParser = $extract->parse($url);

$domainParser->getSubdomain(); // s3-ap-southeast-2 

jkns avatar Mar 13 '19 10:03 jkns