[feature-request] Block data that is flagged by a CC "no-AI-training" license from being ingested into a DataLoader

Open GarrettMerz opened this issue 2 months ago • 1 comments

🚀 The feature

Add support for Creative Commons No-AI-Training license flags

Motivation, pitch

Hello! Creative Commons is introducing "preference signal" licenses, an addition to CC licenses that indicates that a contributor does not wish for their data to be used in model training without attribution, or at all (https://github.com/creativecommons/cc-signals). Currently, they are indicated in the robots.txt and the http header.

From what I can tell, this mechanism can't be meaningfully enforced at the point of site-scraping (as there is no indication within a scraper that data will subsequently be passed to a model), but I am curious about whether the strictest of these are implementable at a technical level at the point of ingestion into the Pytorch Dataloader.

What features would need to be added to ensure that data that is explicitly flagged as do-not-train is not ingestible by a model (is this even doable technically)? If it is not doable, would this change if the license information was implemented in EXIF metadata or similar?

Alternatives

There may be other ways to implement this at other stages within training pipelines.

Additional context

I am not affiliated with Creative Commons! This just seemed like a good discussion to kick off.

Oct 20 '25 15:10 GarrettMerz

Thanks for the question. By data do you mean actual training data or code from Github which can be also used as data ? If it the former, then I'm afraid PyTorch doesn't have access to or control over it. Not sure I am fully following.

Dec 12 '25 17:12 divyanshk