mantis Mantis skips discovery phase for TLDs that are reserved for government entities

Describe the bug Mantis seems to skip discovery phase for TLDs reserved for country/govt entities.

To Reproduce

mantis onboard -o gov:my -t gov.my

[2024-09-04 15:36:36,773] --> INFO: MANTIS Workflow - STARTED
[2024-09-04 15:36:36,773] --> INFO: Executing workname workflowName='default' schedule='daily between 00:00 and 04:00' cmd=[] scanNewOnly=False workflowConfig=[Module(moduleName='discovery', tools=['Subfinder', 'Amass'], order=1), Module(moduleName='prerecon', tools=['FindCDN', 'Naabu'], order=2), Module(moduleName='activehostscan', tools=['HTTPX_Active', 'HTTPX'], order=3), Module(moduleName='activerecon', tools=['Wafw00f'], order=4), Module(moduleName='scan', tools=['DNSTwister', 'Nuclei', 'Corsy'], order=5), Module(moduleName='secretscanner', tools=['SecretScanner'], order=6)]
[2024-09-04 15:36:36,793] --> INFO: Inserting user input into database

0it [00:00, ?it/s]

PRERECON: 100%|

ACTIVEHOSTSCAN: 100%|

System (please complete the following information):

Docker based setup on Ubuntu 24.04.

Additional context

This seems to happen due to the library that is used to categorize the input provided.

Sep 04 '24 15:09 0xbharath

The issue seems to be in the usage of tldextract library in the file mantis/utils/asset_type.py .

>>> tldextract.extract("example.com").registered_domain
'example.com'
>>> tldextract.extract("nic.in").registered_domain
''

tldextract uses the public suffix list for parsing TLDs https://publicsuffix.org/list/public_suffix_list.dat

Sep 04 '24 16:09 0xbharath

shouldn't this issue be fixed at the source?

Oct 03 '24 15:10 dmdhrumilmistry

Ideally, yes. It would be tricky to get the library to impart this changes. We are trying to see if we can find a workaround or use a different library to fix this issue.

Oct 05 '24 07:10 0xbharath

After thinking about it, I don't think there's something wrong with the library. nic.in is supposed to be used as TLD. so if you're using library to extract registered domain from string consisting only TLD then it should return empty string.

>>> import tldextract
# querying str with TLD only
>>> tldextract.extract("com").registered_domain
''
>>> tldextract.extract("nic.in").registered_domain
''

# querying str with labels + tld
>>> tldextract.extract("example.com").registered_domain
'example.com'
>>> tldextract.extract("subdomain.example.com").registered_domain
'example.com'
>>> tldextract.extract("example.nic.in").registered_domain # works since it has label + TLD
'example.nic.in'
>>> tldextract.extract("subdomain.example.nic.in").registered_domain
'example.nic.in'

Oct 05 '24 07:10 dmdhrumilmistry

@0xbharath can you provide an real world scenario example, I'll take a look into this

Oct 21 '24 07:10 dmdhrumilmistry