ai-web-impact icon indicating copy to clipboard operation
ai-web-impact copied to clipboard

Public datasets based on web crawls

Open tarkowski opened this issue 2 months ago • 0 comments

Thank you for a thoughtful and important white paper. I would like to suggest that public datasets based on web crawls, such as Common Crawl, should be placed within the scope of the paper. While some AI developers crawl the web directly, others rely on such datasets for training their ML models. As such, these datasets constitute an important form of intermediation of web content for the purpose of AI training. Many of the challenges listed in the paper, and suggested ways of mitigating them, apply to these datasets, and the organizations that build them and make them available. Therefore, bringing to life the ethical web principles also requires proper governance of this intermediary stage.

Also, building on issue #26 , it would be worthwile to consider whether such training datasets – as a representation of the web that needs to meet same ethical requirements – should not be governed, as a public resource, as part of Web governance mechanisms and institutions.

tarkowski avatar Apr 20 '24 19:04 tarkowski