Categorize domains by content
People using the 'External' tab are most likely trying to discover what type of content people usually link to. I think they should be categorized, likely using some sort of website classification API such as Klazify. This information could be displayed as another bar chart or a pie chart. However, some people may consider using a website classification API as a privacy issue. It might be possible to do this offline using a dataset or machine learning model, but I have not been able to find one available for free.
I like this idea! And seems a good way to fill the External tab since there is only one card 👍
Someday we may find a dataset, maybe taking the top 1000 Alexa Rank pages (RIP)
After researching the Alexa rankings a bit, I found out that Cloudflare made a replacement to it which actually lets you download the data on the top 1000000 domains along with their classification, 72.4% of which are human-classified. The classifications also seem specific enough to be useful to users while also being vague enough to have several domains categorized together. This looks perfect!
Only top 100 with classification unfortunately 😞, we'll have to keep looking I guess
This is public data and I think it provides categorization but it requires AWS to query so I've not been able to test it yet https://commoncrawl.org/
This is public data and I think it provides categorization but it requires AWS to query so I've not been able to test it yet https://commoncrawl.org/
I could not find domains categorized here
I found this dataset of ~1.5 million websites (github page for code used). It's a bit outdated (2020), and it uses an NLP model, rather than humans, and its only classification is by URLs, so I'll try to see if I can find something better, but for now this is the best I could find.
Only top 100 with classification unfortunately 😞, we'll have to keep looking I guess
For Cloudflare, we could try the API endpoints, specifically this one, which gives details about website categories.