chat-analytics icon indicating copy to clipboard operation
chat-analytics copied to clipboard

Categorize domains by content

Open hopperelec opened this issue 2 years ago • 7 comments

People using the 'External' tab are most likely trying to discover what type of content people usually link to. I think they should be categorized, likely using some sort of website classification API such as Klazify. This information could be displayed as another bar chart or a pie chart. However, some people may consider using a website classification API as a privacy issue. It might be possible to do this offline using a dataset or machine learning model, but I have not been able to find one available for free.

hopperelec avatar Dec 31 '22 03:12 hopperelec

I like this idea! And seems a good way to fill the External tab since there is only one card 👍

Someday we may find a dataset, maybe taking the top 1000 Alexa Rank pages (RIP)

mlomb avatar Jan 09 '23 16:01 mlomb

After researching the Alexa rankings a bit, I found out that Cloudflare made a replacement to it which actually lets you download the data on the top 1000000 domains along with their classification, 72.4% of which are human-classified. The classifications also seem specific enough to be useful to users while also being vague enough to have several domains categorized together. This looks perfect!

hopperelec avatar Jan 09 '23 18:01 hopperelec

Only top 100 with classification unfortunately 😞, we'll have to keep looking I guess

mlomb avatar Jan 10 '23 19:01 mlomb

This is public data and I think it provides categorization but it requires AWS to query so I've not been able to test it yet https://commoncrawl.org/

hopperelec avatar Jan 10 '23 20:01 hopperelec

This is public data and I think it provides categorization but it requires AWS to query so I've not been able to test it yet https://commoncrawl.org/

I could not find domains categorized here

mlomb avatar Jan 10 '23 20:01 mlomb

I found this dataset of ~1.5 million websites (github page for code used). It's a bit outdated (2020), and it uses an NLP model, rather than humans, and its only classification is by URLs, so I'll try to see if I can find something better, but for now this is the best I could find.

AmazTING avatar Jun 04 '23 14:06 AmazTING

Only top 100 with classification unfortunately 😞, we'll have to keep looking I guess

For Cloudflare, we could try the API endpoints, specifically this one, which gives details about website categories.

AmazTING avatar Jun 05 '23 07:06 AmazTING