TLDExtract
TLDExtract copied to clipboard
Parser bug when subdomain has "-"
The parser fails for the following:
// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';
// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);
parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null
I don't believe this is an issue with hyphens, it's an issue with S3 domains.
s3-ap-southeast-2.amazonaws.com is defined as a private domain - https://github.com/publicsuffix/list/blob/master/public_suffix_list.dat#L10747
Once you parse the S3 domain you end up with:
subdomain: null
hostname: s3-ap-southeast-2.amazonaws.com
suffix: null
So you could use $domainParser->getHostname().
If you don't care about private domains you can do this:
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';
// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract(null, null, \LayerShifter\TLDExtract\Extract::MODE_ALLOW_ICCAN);
$domainParser = $extract->parse($url);
$domainParser->getSubdomain(); // s3-ap-southeast-2