Automate the tracking of Dark Visitors
The idea is to scrape the content of Dark Visitors using a bot and generate PRs for this project. A bit like dependabot.
@cdransf: please could you add a "help wanted" label to this issue? You never know, someone might automate this for us.
@glyn added! I don't see a feed on the site, so this may be manual (I'm signed up for their emails).
Dark Visitors now offers an API, which may be helpful. It does require a sign up for a token.
The API only offers a subset of list so far. e.g. It does not include uncategorized ones. Fortunately, the site itself does not prevent any bots at all at this point.
I managed to add this feature to my own repo with a python script and commit hooks. It caches the response in a sqlite database as well.
If such added dependencies are fine with you, I can create a PR to help this. But I need to understand your preferences in terms of:
- How often, where, and how it should be executed;
- What categories of bots should be disallowed by default;
- Should it modify files other than the robots.txt;
and any other ideas you might have.
Many thanks for this repo!
@ChenghaoMou that would be great! Perhaps daily to update robots.json? Pushes containing updates to robots.json will then generated an updated version of table-of-bot-metrics.md and robots.txt.
In the past (3 months ago) I wrote a little script to generate robots.txt from DarkVisitor web page. If it can be helpfull :)
https://git.dryusdan.fr/Dryusdan/darkvisitor-useragent-scrapper
The only need is to make a .env in the root of repository folder with
darkvisitors_token=<API token not used>
allow_categories=["Archiver","AI Search Crawler","Developer Helper","Fetcher","Search Engine Crawler","Uncategorized"]
custom_deny_agents=["AhrefsSiteAudit","Mail.RU_Bot","Twitterbot"]
allow_categories is the category we allowed bot (like Google bot), other will be write in robots.txt
@cdransf Here is the workflow working on my fork: run.
Based on the documentation, the API/token offered still does not include all the information rn.
Agent types include AI Assistant, AI Data Scraper, AI Search Crawler, and Undocumented AI Agent.
vs what is listed on the webpage
[ "AI Assistants", "AI Data Scrapers", "AI Search Crawlers", "Archivers", "Developer Helpers", "Fetchers", "Intelligence Gatherers", "Scrapers", "Search Engine Crawlers", "SEO Crawlers", "Uncategorized", "Undocumented AI Agents" ]
If it looks good, I can create a PR shortly.
@cdransf Here is the workflow working on my fork: run.
Based on the documentation, the API/token offered still does not include all the information rn.
Agent types include AI Assistant, AI Data Scraper, AI Search Crawler, and Undocumented AI Agent.
vs what is listed on the webpage
[ "AI Assistants", "AI Data Scrapers", "AI Search Crawlers", "Archivers", "Developer Helpers", "Fetchers", "Intelligence Gatherers", "Scrapers", "Search Engine Crawlers", "SEO Crawlers", "Uncategorized", "Undocumented AI Agents" ]
If it looks good, I can create a PR shortly.
This looks excellent! Can we scope it to just AI crawlers? E.g.
[
"AI Assistants",
"AI Data Scrapers",
"AI Search Crawlers",
"Undocumented AI Agents"
]
This is great! A couple followup questions:
- can releases also be automated when there's a change, for people who follow the releases web feed?
- ~maybe the data source should be mentioned in the
READMEand/or the FAQ? Just to make it clear that the list is not maintained manually anymore~ (edit: this is inaccurate, see next comments) - overengineering idea: offer a GitHub Action that can inject the
robots.txtfrom this repo into static websites
* maybe the data source should be mentioned in the `README` and/or the FAQ? Just to make it clear that the list is not maintained manually anymore
Isn't Dark Visitors a subset of our robots.txt? If so, the rest is manually maintained.
Apologies, I didn't properly look at this and probably misunderstood the relationship between the two projects :see_no_evil: Also I just noticed that the README already clearly mentions the data source. Ignore me.