Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

add CC BY 4.0 to terms of use

Open wassname opened this issue 2 years ago • 14 comments

You can see a similar statement is used in Wikipedia terms of use and it makes sure that the user contributed data is clear to be released under a CC with no disputes. Ideally it's added at the start, so it's good to add it now.

wassname avatar Feb 12 '23 07:02 wassname

This follows a conversation with Huu Nguyen on discord

wassname avatar Feb 12 '23 07:02 wassname

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Feb 12 '23 07:02 github-actions[bot]

Why? the data of Open Assistant is CC BY 4.0. https://projects.laion.ai/Open-Assistant/docs/faq#can-i-download-the-data

wannaphong avatar Feb 12 '23 08:02 wannaphong

Sure that's the intention but the user hasn't agreed that thier contribution is. For that we can include it in the terms of use. It's what most websites do.

On Sun, 12 Feb 2023, 4:51 pm Wannaphong Phatthiyaphaibun, < @.***> wrote:

Why? Open Assistant is CC BY 4.0. https://projects.laion.ai/Open-Assistant/docs/faq#can-i-download-the-data

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/pull/1508#issuecomment-1426975788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAINOYQ6SI45AUJCJDN77QTWXCP7FANCNFSM6AAAAAAUZEPVZE . You are receiving this because you authored the thread.Message ID: @.***>

wassname avatar Feb 12 '23 12:02 wassname

The problem here is that you have written CC BY-SA in the ToS but this is not what has been discussed previously. CC BY is not the same as CC BY-SA.

olliestanley avatar Feb 12 '23 12:02 olliestanley

The problem here is that you have written CC BY-SA in the ToS but this is not what has been discussed previously. CC BY is not the same as CC BY-SA.

Yes, CC BY-SA is not the same as CC BY.

wannaphong avatar Feb 12 '23 13:02 wannaphong

Hey hey! I think it should be CC-BY-4.0 as this is consistent with other LAION datasets. Sorry my bad if I miscommunicated. And thank you for doing this PR!

huu4ontocord avatar Feb 12 '23 13:02 huu4ontocord

Short technical question: For >99% of our users we don't have a real name, only an e-mail address or a discord-id and of course a display name for the website (which is automatically generated during e-mail signup). What counts as "Attribution", i.e. where/how will we list the (currently) >22k users by name? I guess most users would prefer not to have their e-mail address published...

andreaskoepf avatar Feb 12 '23 16:02 andreaskoepf

Short technical question: For >99% of our users we don't have a real name, only an e-mail address or a discord-id and of course a display name for the website (which is automatically generated during e-mail signup). What counts as "Attribution", i.e. where/how will we list the (currently) >22k users by name? I guess most users would prefer not to have their e-mail address published...

Oh this is a good point. I had a look on Wikipedia and they don't allow attribution to Wikipedia itself, you link to the article where the page history and list of contributors can be found. So this may not be viable. If you're gonna have attribution then you have to hold people's information perpetually. This will have GDPR implications; you'll have to honour "right to removal / be forgotten" requests as a data controller for as long as you're in control of it. Huggingface, as data processors will only be forced to do this if LAION don't/can't. It's promising "I'll react to volumes of frivolous deletion requests, within 28 days, or face hefty fines"

A more open license that doesn't require attribution would be preferable IMO. A few options:

  • Users giving LAION the data under CC0 and the data being released the same way seems the fairest way to do it.
  • Users giving the right for LAION to redistribute under a CC-BY with attribution to them. This seems a bit crappy - users have to attribute LAION, but they don't get the same back.
  • The worst would be a generic copyright grant (web2.0 data highwayman approach) where the data could be closed up at any time. Even if data was released under CC0 it feels uneven.

bitplane avatar Feb 12 '23 18:02 bitplane

Right, I get the point now. Yeah it makes sense to make the output unencumbered by share alike. CC BY-SA is problematic and as bitplane pointed out even CC-BY can be burdensome.

We could go CC-BY and just include our best effort list of all user display names and put it on the website and in the dataset. Perhaps even saying that if you don't provide you name you waive right to attribution.

The makehuman project uses CC0 for it's output, so perhaps that would be the best. In that case I should change the FAQ too.

What do people think?

wassname avatar Feb 12 '23 22:02 wassname

I think you see Common Voice project. CommonVoice project is use CC-0 and It's a best project for speech dataset.

I don't sure about CC-0 with text corpus but I think if the corpus can be CC-0, It will a best corpus.

wannaphong avatar Feb 13 '23 03:02 wannaphong

Changing the license now would be problematic, since all past contributors would need to be asked for permission. Granted, it's also not very clear to contributors that the current license is CC BY, so we may be in a pickle regardless.

hecko-yes avatar Feb 13 '23 06:02 hecko-yes

Personally I'm convinced that CC-0 is better, but if we can't get consensus we should merge this CC BY change right now. It's noncontroversial and puts us on a better legal good footing and can be changed to CC-0 later.

Then we can make a new issue to debate CC-0, and change it if we get consensus.

So reviewers, let's merge it!

wassname avatar Feb 23 '23 04:02 wassname

Sorry I though I had done that. Now it's CC BY 4.0

wassname avatar Feb 23 '23 07:02 wassname

I think we effectively ask users to provide inputs CC-0, maybe something like the following should be added to the terms to make this clearer:

"If the user's input constitutes a work protected by copyright, the user grants LAION a simple, temporally, spatially and factually unrestricted right to use the input. In particular, LAION is authorized to use the user's inputs for the development and improvement of large language models."

andreaskoepf avatar Jun 08 '23 19:06 andreaskoepf