datacomp icon indicating copy to clipboard operation
datacomp copied to clipboard

Remove CSAM, if present

Open ahundt opened this issue 1 year ago • 5 comments

A recent report definitively found CSAM in LAION-5B, and that dataset has been taken down until the problem can be solved. The DataComp dataset is much larger. Please let us know what steps you have taken and/or plan to take to address this issue responsibly. Thanks!

https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/

Edit: Ali Alkhatib also makes a good point that, should dataset changes be needed, they might need to be mixed in with additional simultaneous data changes so an old version can't be easily diffed against a new version to find harmful material, among other best practices.

https://x.com/_alialkhatib/status/1737484384914092156?s=46

ahundt avatar Dec 20 '23 15:12 ahundt

Thank you for the suggestion for improving DataComp. The cited study uses one of LAION’s NSFW classifiers to find CSAM content in LAION-5B. Unlike LAION-5B, we removed NSFW content when assembling DataComp, so to the best of our knowledge, the CSAM images in question are not in DataComp. We will review this issue in more depth and welcome specific suggestions for removing content from DataComp. For additional information, please see Section 3.2, Appendix E, and Appendix G of the DataComp paper, which describe our safety measures in more detail.

ludwigschmidt avatar Dec 29 '23 11:12 ludwigschmidt

Thank you for your reply. I appreciate your attention to my concerns. However, I would like to draw your attention to the fact that my name is already mentioned in the acknowledgement section on page 10 of your paper, indicating that I have previously read and shared several items about the design, construction, collection, and publication approach to this dataset with another member of your team. While they have been noted, unfortunately, these concerns have not been addressed in practice, to the best of my knowledge, which would require actions like those found in the papers I reference below.

Regarding CSAM, the 404 media article makes explicit the very high risk posed. I would appreciate it if you could substantively address the items in this issue since I was asking what you’ve done now beyond what is outlined in the paper.

Simply multiplying your own error rate figures by the scale of your dataset provides very large numbers for potentially problematic images in your dataset. Work by multiple Birhane et al papers as well as the Stanford group that verified the CSAM in LAION includes substantially more comprehensive evaluation steps that have not been completed, according to your paper.

Here is Dr. Birhane’s Google Scholar page with the relevant papers and methods:

  1. Multimodal Datasets
  2. Data-swamps
  3. LAION’s den
  4. Large image datasets

Here is the page with the Stanford group’s work detecting CSAM.

The paper stable bias is also likely to be relevant. https://arxiv.org/abs/2303.11408

I would appreciate it if this matter were taken seriously and acted upon with equal or greater care and attention than authors of the papers I’ve provided have taken. The reasons detailed in the 404 media article make the risks, motivation for addressing the risks, and the impacts all crystal clear.

Thank you for your time and consideration.

ahundt avatar Jan 20 '24 19:01 ahundt

i need someone from telegram to help asap theres tons of people and channels posting selling trading csam of a friend who was 14 at the time and emailing telegram and reporting does nothing he was lured into sending csam videos of himself to a pedo from houston texas that than sold the videos and still does and there literally all over telegram and we cant stop the people we also kept a list of accounts but theres too many to catch and we dont know how to stop people from posting these on telegram
-its a problem on instagram twitter/x and telegram these videos are everywhere please help

Lwantstostophim avatar Aug 24 '24 00:08 Lwantstostophim

@Lwantstostophim If you’re in the United States you need to contact the FBI https://www.fbi.gov/contact-us

If you’re in another country in which it is safe to do so you should report to the equivalent authorities.

ahundt avatar Aug 27 '24 01:08 ahundt

we tried everything and now we cant stop the videos from spreading its literally being spread 24/7

On Mon, Aug 26, 2024 at 9:10 PM, Andrew Hundt @.***(mailto:On Mon, Aug 26, 2024 at 9:10 PM, Andrew Hundt < wrote:

@.***(https://github.com/Lwantstostophim) If you’re in the United States you need to contact the FBI https://www.fbi.gov/contact-us

If you’re in another country in which it is safe to do so you should report to the equivalent authorities.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Lwantstostophim avatar Aug 27 '24 07:08 Lwantstostophim