ai.robots.txt icon indicating copy to clipboard operation
ai.robots.txt copied to clipboard

Automate the tracking of Dark Visitors

Open glyn opened this issue 1 year ago • 4 comments

The idea is to scrape the content of Dark Visitors using a bot and generate PRs for this project. A bit like dependabot.

glyn avatar Mar 29 '24 04:03 glyn

@cdransf: please could you add a "help wanted" label to this issue? You never know, someone might automate this for us.

glyn avatar Mar 29 '24 04:03 glyn

@glyn added! I don't see a feed on the site, so this may be manual (I'm signed up for their emails).

cdransf avatar Apr 01 '24 16:04 cdransf

Dark Visitors now offers an API, which may be helpful. It does require a sign up for a token.

GlitzSmarter avatar Apr 09 '24 21:04 GlitzSmarter

The API only offers a subset of list so far. e.g. It does not include uncategorized ones. Fortunately, the site itself does not prevent any bots at all at this point.

I managed to add this feature to my own repo with a python script and commit hooks. It caches the response in a sqlite database as well.

If such added dependencies are fine with you, I can create a PR to help this. But I need to understand your preferences in terms of:

  1. How often, where, and how it should be executed;
  2. What categories of bots should be disallowed by default;
  3. Should it modify files other than the robots.txt;

and any other ideas you might have.

Many thanks for this repo!

ChenghaoMou avatar Apr 14 '24 10:04 ChenghaoMou

@ChenghaoMou that would be great! Perhaps daily to update robots.json? Pushes containing updates to robots.json will then generated an updated version of table-of-bot-metrics.md and robots.txt.

cdransf avatar Aug 06 '24 15:08 cdransf

In the past (3 months ago) I wrote a little script to generate robots.txt from DarkVisitor web page. If it can be helpfull :)

https://git.dryusdan.fr/Dryusdan/darkvisitor-useragent-scrapper

The only need is to make a .env in the root of repository folder with

darkvisitors_token=<API token not used>
allow_categories=["Archiver","AI Search Crawler","Developer Helper","Fetcher","Search Engine Crawler","Uncategorized"] 
custom_deny_agents=["AhrefsSiteAudit","Mail.RU_Bot","Twitterbot"]

allow_categories is the category we allowed bot (like Google bot), other will be write in robots.txt

Dryusdan avatar Aug 06 '24 15:08 Dryusdan

@cdransf Here is the workflow working on my fork: run.

code

Based on the documentation, the API/token offered still does not include all the information rn.

Agent types include AI Assistant, AI Data Scraper, AI Search Crawler, and Undocumented AI Agent.

vs what is listed on the webpage

[ "AI Assistants", "AI Data Scrapers", "AI Search Crawlers", "Archivers", "Developer Helpers", "Fetchers", "Intelligence Gatherers", "Scrapers", "Search Engine Crawlers", "SEO Crawlers", "Uncategorized", "Undocumented AI Agents" ]

If it looks good, I can create a PR shortly.

ChenghaoMou avatar Aug 06 '24 17:08 ChenghaoMou

@cdransf Here is the workflow working on my fork: run.

code

Based on the documentation, the API/token offered still does not include all the information rn.

Agent types include AI Assistant, AI Data Scraper, AI Search Crawler, and Undocumented AI Agent.

vs what is listed on the webpage

[ "AI Assistants", "AI Data Scrapers", "AI Search Crawlers", "Archivers", "Developer Helpers", "Fetchers", "Intelligence Gatherers", "Scrapers", "Search Engine Crawlers", "SEO Crawlers", "Uncategorized", "Undocumented AI Agents" ]

If it looks good, I can create a PR shortly.

This looks excellent! Can we scope it to just AI crawlers? E.g.

[
"AI Assistants",
"AI Data Scrapers",
"AI Search Crawlers",
"Undocumented AI Agents"
]

cdransf avatar Aug 06 '24 18:08 cdransf

This is great! A couple followup questions:

  • can releases also be automated when there's a change, for people who follow the releases web feed?
  • ~maybe the data source should be mentioned in the README and/or the FAQ? Just to make it clear that the list is not maintained manually anymore~ (edit: this is inaccurate, see next comments)
  • overengineering idea: offer a GitHub Action that can inject the robots.txt from this repo into static websites

robinmetral avatar Aug 18 '24 04:08 robinmetral

* maybe the data source should be mentioned in the `README` and/or the FAQ? Just to make it clear that the list is not maintained manually anymore

Isn't Dark Visitors a subset of our robots.txt? If so, the rest is manually maintained.

glyn avatar Aug 18 '24 06:08 glyn

Apologies, I didn't properly look at this and probably misunderstood the relationship between the two projects :see_no_evil: Also I just noticed that the README already clearly mentions the data source. Ignore me.

robinmetral avatar Aug 18 '24 12:08 robinmetral