Mantis skips discovery phase for TLDs that are reserved for government entities
Describe the bug Mantis seems to skip discovery phase for TLDs reserved for country/govt entities.
To Reproduce
mantis onboard -o gov:my -t gov.my
[2024-09-04 15:36:36,773] --> INFO: MANTIS Workflow - STARTED
[2024-09-04 15:36:36,773] --> INFO: Executing workname workflowName='default' schedule='daily between 00:00 and 04:00' cmd=[] scanNewOnly=False workflowConfig=[Module(moduleName='discovery', tools=['Subfinder', 'Amass'], order=1), Module(moduleName='prerecon', tools=['FindCDN', 'Naabu'], order=2), Module(moduleName='activehostscan', tools=['HTTPX_Active', 'HTTPX'], order=3), Module(moduleName='activerecon', tools=['Wafw00f'], order=4), Module(moduleName='scan', tools=['DNSTwister', 'Nuclei', 'Corsy'], order=5), Module(moduleName='secretscanner', tools=['SecretScanner'], order=6)]
[2024-09-04 15:36:36,793] --> INFO: Inserting user input into database
0it [00:00, ?it/s]
PRERECON: 100%|
ACTIVEHOSTSCAN: 100%|
System (please complete the following information):
Docker based setup on Ubuntu 24.04.
Additional context
This seems to happen due to the library that is used to categorize the input provided.
The issue seems to be in the usage of tldextract library in the file mantis/utils/asset_type.py .
>>> tldextract.extract("example.com").registered_domain
'example.com'
>>> tldextract.extract("nic.in").registered_domain
''
tldextract uses the public suffix list for parsing TLDs https://publicsuffix.org/list/public_suffix_list.dat
shouldn't this issue be fixed at the source?
Ideally, yes. It would be tricky to get the library to impart this changes. We are trying to see if we can find a workaround or use a different library to fix this issue.
After thinking about it, I don't think there's something wrong with the library. nic.in is supposed to be used as TLD. so if you're using library to extract registered domain from string consisting only TLD then it should return empty string.
>>> import tldextract
# querying str with TLD only
>>> tldextract.extract("com").registered_domain
''
>>> tldextract.extract("nic.in").registered_domain
''
# querying str with labels + tld
>>> tldextract.extract("example.com").registered_domain
'example.com'
>>> tldextract.extract("subdomain.example.com").registered_domain
'example.com'
>>> tldextract.extract("example.nic.in").registered_domain # works since it has label + TLD
'example.nic.in'
>>> tldextract.extract("subdomain.example.nic.in").registered_domain
'example.nic.in'
@0xbharath can you provide an real world scenario example, I'll take a look into this