crawl4ai
crawl4ai copied to clipboard
Please respect TDM Reservation Protocol
I know libertarians will not be happy but ...
In Europe, scrapping websites for the purpose of Text and Data Mining and LLM training is legal (this is the good news), unless (this is the bad news) there is a machine readable signal on the website stating an opt-out from this exception to copyright. Which means that if an opt-out is set but a scrapper still fetches copyrighted content, the user of the scrapper is on the unsafe side of EU laws.
There are different ways to express such an opt-out signal: robots.txt is one (with its limitations) and the TDM Reservation Protocol is another. The fact that a machine readable opt-out is or is not an official standard by some well-known or obscure entity does not matter.
With TDMRep, the opt-out signal can be in a specific file on the web server (similar to robots.txt but specialised), in HTTP responses or in HTML pages.
TDMRep is now used by many news websites in Europe. Offering TDMRep support as a configurable option would be useful for those users who want to stay on the safe side of EU laws.
nb: TDMRep rules can be checked after the filter resulting from robots.txt.