Unlawful use of my code
The readme of this repo reads the following:
StarCoder2 is a family of code generation models (3B, 7B, and 15B), trained on 600+ programming languages from The Stack v2 [...]
The dataset linked contains my code, without following its license (or lack thereof).
Consent is not opt-out. You trained an LLM on code you are not allowed to use.
Stop being a little cry baby
womp womp you cant use my code without following its license
Quote from https://policies.stackoverflow.co/company/trademark-guidance/:
"We decided early on that all user-generated content in the Stack Exchange Network would be given back to the community under a Creative Commons license."
Furthermore, see Point 6 of the ToS Section "Subscriber Content":
You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you as reasonably necessary [...] [...] This means that you cannot revoke permission for Stack Overflow to publish, distribute, store and use such content and to allow others to have derivative rights to publish, distribute, store and use such content. The CC BY-SA 4.0 license terms are explained in further detail by Creative Commons, and the license terms applicable to content are explained in further detail here. You should be aware that all Public Content you contribute is available for public copy and redistribution, and all such Public Content must have appropriate attribution.
this isnt stack overflow lmfao i dont care about their tos
Your code was taken from Stack Overflow, where it was available under the license mentioned above. With posting it on Stack Overflow, you've forfeit all rights to the content you posted. Thus, inclusion in starcoder2 is lawful
i have never posted it on stack overflow, show me where exactly i have done so
Where is the code you're referring to and why do you think it's in the dataset?
this issue was created before the opt-out was even a thing, meaning the model was trained on code it wasn't allowed to use (and even then, consent is not opt-out). you can't "untrain" stuff.
https://github.com/bigcode-project/opt-out-v2/issues/1814
as this reply says:
Your opt-out request has been processed and your data was removed in version v2.1 of The Stack and all future versions.
this means that my code is in all previous versions of the dataset, and given that this repository was created before the opt-out (which wouldn't hold water anyway because consent is not opt-out) was even a thing. stop making this about stack overflow when it wasn't ever mentioned.
Hi, was there any resolution to this?