imagemonkey-core icon indicating copy to clipboard operation
imagemonkey-core copied to clipboard

new google dataset,

Open dobkeratops opened this issue 7 years ago • 2 comments

https://ai.googleblog.com/2018/09/introducing-inclusive-images-competition.html

whats the license on these images (would they require visible attribution to copy.. they seem to provide a credits list). i think they've got bounding boxes, and a fair number of labels , but the label lists dont look exhaustive yet... do they have scope for further annotation

https://www.kaggle.com/c/inclusive-images-challenge/data

even if you can't copy them, i guess it's just another label list to try and add to the graph.. it should be possible to train the same net on multiple image sets

dobkeratops avatar Sep 09 '18 15:09 dobkeratops

Very cool, many thanks for sharing!

Looks like, they also made some progress -the last time I checked, they only had a github repo, but now there seems to be a full blown site (https://storage.googleapis.com/openimages/web/index.html). Very impressive!

Every time one of the big ones releases a dataset, I am always asking myself: "Are we on the right track? Does this even make sense? Should we allow other licenses besides CC0?..."

Don't get me wrong, we made some really great progress this year - would have never imagined that - but compared to google with its ~9M images it feels like "nothing".

But what I keep always forgetting is: we can't win against Google (or any other big company) when it comes down to sheer size. No matter, how hard we try, it's still a community driven project where people contribute in their free time. And that's fine. I think they are many more ways we can position ourselves and "win":

  • better tooling (better integration of third party tools, ML libraries, querability, ..)
  • truly open source (not company driven)
  • friendly community
  • some niche labels (I guess no matter how big your dataset is, it's never complete)
  • only public domain images
  • ...

I think the "only public domain images" is probably the most controversal point on the list. I also thought more than once about introducing differently licensed images. But I am a bit afraid that it gets really complicated then, to both maintain (the software) and use the dataset. I think it also gets pretty complicated for users then ("Can I use xy licensed data in my closed source application?"). In general CC4 licensed content requires you to attribute the content creator. But what about, if I use a neural net in my application, that was trained on hundreds of images. Do I need to attribute every image creator now?

In order to avoid such legal nightmares, I would prefer the CC0 ("do the f*ck you want") license. But that's just my personal opinion and as always not written in stone ;)

Speaking of public images: In case you've missed it: #166 Do you have any objections against that? If not, then I would upload the first archive to archive.org in course of the next week.

bbernhard avatar Sep 09 '18 17:09 bbernhard

But what about, if I use a neural net in my application, that was trained on hundreds of images. Do I need to attribute every image creator now?

thats a great point especially considering that these nets can sometimes regenerate the images

we can focus on making the most of the images we have I guess (continuing to refine the labels with attributes ..). I know what you mean about the 'drop in the ocean' effect but the logic stands to keep independent efforts going. I agree the query/seach stuff thats grown around the labelling system is an interesting dimension here.

dobkeratops avatar Sep 09 '18 17:09 dobkeratops