sssom
sssom copied to clipboard
What happens to licenses on merge?
Let's say we have two mapping sets: M1 and M2. M1 is published under CC-0, and M2 is published under CC-BY.
What should happen when we merge M1 and M2 to M3 with our SSSOM python toolkit?
- Option 1: M3 license is unspecied, but cannot be less than CC-BY
- Option 2: M3 license becomes CC-BY by default but can be changed to something that is compatible with CC-BY
- Option 3: M3 can be any license (the licenses of M1 and M2 mean nothing). As long as the conditions for CC-BY are met, M3 can be CC-0 or whatever the publisher wants it to be.
- Option 4: M3 can be any license, but the individual term mappings from M1 get individual license metadata CC-0, and individual mappings from M2 will all get license metadata CC-BY.
- Option 5: We close our eyes and do what we all do in open data, and hope no one sues anyone.
Note that in the ontology world, we use Option 5.
@matentzn IANAL (blah blah blah), but I can give a little perspective maybe from the what the RDP has done.
You are correct that Option 5 is what's done in a lot of cases. Essentially, it is an attempt to offload the responsibilities of data integration onto somebody else. This is easy and has a lot of precedent, but is not satisfying in a lot of ways. Let's maybe explore the space where that is not on the table for a little bit.
As a starter, I want to highlight three things: 1) that CC0 is not a license, but a tool to attempt to dedicate a work into the public domain (or create something that works the same way); 2) that license terms are immutable (without permission/a new license from the rights holder); 3) a work without any other information to the contrary is considered to under standard copyright protection in the US and anything you do outside of that is at your own risk.
To clarify what is being asked here, what is the nature of M3 and what do "mapping sets" and "merge" might mean here? I'm taking this to just mean the data. There is no clear rule for what is "fair use" versus "infringing use" versus "correct use" outside of negotiations or a court room, so it's sometimes easier to think about the data overall in the beginning.
Is this to mean that M1 is CC0, M2 is CC BY, and then the Options are using the terms for M3 as a variable?
what is the nature of M3 and what do "mapping sets" and "merge" might mean here?
Imagine two mapping sets (literally a set of ontology mappings), M1 and M2:
M1:
mapping_set_id: M1
license: CC-0
| subject | predicate | object |
|---|---|---|
| A | equivalentTo | C |
| B | equivalentTo | D |
M2:
mapping_set_id: M2
license: CC-BY
| subject | predicate | object |
|---|---|---|
| V | equivalentTo | X |
| W | equivalentTo | Y |
M1 is published under CC-0 (public domain) and M2 is published under CC-BY.
Now we have a process MERGE that combines the two to M3:
M3:
mapping_set_id: M3
license: ???
| subject | predicate | object |
|---|---|---|
| A | equivalentTo | C |
| B | equivalentTo | D |
| V | equivalentTo | X |
| W | equivalentTo | Y |
Lets assume for now the creation of M3 (a derivative?) itself was correct use according to licenses (its probably not, but lets set that aside for now).
Concretely, I would like to know what our official merge tool should do with the original license metadata. Include it? Drop it? In the ontology world, we usually drop it. But you could include the license constraints as some kind of column:
| subject | predicate | object | license |
|---|---|---|---|
| A | equivalentTo | C | CC-0 |
| B | equivalentTo | D | CC-0 |
| V | equivalentTo | X | CC-BY |
| W | equivalentTo | Y | CC-BY |
or you could make a comment:
mapping_set_id: M3
license: CC-0
license_comment: This is a derived product. Original licenses apply.
derived from:
- M1
- M2
| subject | predicate | object |
|---|---|---|
| A | equivalentTo | C |
| B | equivalentTo | D |
| V | equivalentTo | X |
| W | equivalentTo | Y |
I know there wont be any 100% answers, but I have no idea at all what we should do here. I will be lobbying, pressing for CC-BY (or CC-0) across the board to make it not more complex then CC-BY vs no license.
@matentzn Yeah, uniform and sensible licensing would make so many things so much easier. I think this case is pretty easy though. Forgive the verbiage.
Okay, so it looks like that for the sake of simplicity you have M1 as CC0 and M2 as CC BY (4.0); M3 is a combination of the two. My best guess, as a hopefully reasonable lay person, would be like this:
In the US, and hopefully most reasonable places, M1's CC0 can essentially be read as public domain, so you can do whatever you want with it. M2's copyrights are held by a copyright holder; let's call them CH2. CH2 has granted you a certain set of abilities to work with M2, including creating derivative works, assuming that you adhere to the terms of CC BY 4.0. Great. So you create a new work M3, that is M1 + M2. What is the best practice?
CH2 does not hold copyright over this entire work, just portions. The new attribution could be essentially "derivative of M2 with added public domain data". I could also imagine that for complicated derivative works, one could have "OBO Foundry" be a new rights holder and is licensing it under CC BY 4.0, so that the attribution would be like "portions of this file are derivative of M2 and OBO Foundry, CC by 4.0".
Following your example above, the derivative has portions that are CC BY 4.0 no matter what, as you have no other clear avenue from CH2 (i.e. it cannot overall be public domain). It might be most clear to have the entire derived work under CC BY 4.0 then.
mapping_set_id: M3
license: CC-BY-4.0
license_comment: This is a derived product, including M1, dedicated to the public domain. M2 title, author, source, license (TASL).
derived from:
- M1
- M2
A simple CC exercise (and some additional info) on this can be found here: https://wiki.creativecommons.org/wiki/Best_practices_for_attribution
Does this help clear anything up?
Looking at your original five Options, I think we need an Option 6: M3 is a derivative work of a CC BY work (M2) and public domain data (M1); the new work may be CC BY as long as the attribution information is kept for M2.
For the sake of cleanliness, I'd still note that the additional public domain data was the CC0-declared M1.
Yes, this is very helpful. Am I correct in assuming that someone deriving M4 (sic) from M3 needs to preserve the attribution to M2?
Instead of license_comment, can you propose something we can do to "attribute" that is a bit more structured? Something like this, hopefully as minimal as humanly possible:
derived from:
- mapping_set_id: M1
license: CC-0
creator_id: ORCID:123
- mapping_set_id: M2
license: CC-BY
# (no further attribution required)
Would this satisfy your sense of CC-BY spirit?
@matentzn Yes, I believe that is a correct interpretation: copyright and license are immutable unless new terms are negotiated--no matter how many derivatives are produced, they preserve rights of all the originating sources.
"Spirit" and what is reasonable can be hard to define. For myself, I don't have a feeling how this all hangs together and what the overall context is, so I'm grasping at straws a bit. It might be good to make sure that whatever you come up with is flexible enough to be revisited in the future. I'm assuming that the proposed derived_from field above is embedded in M3 itself?
As a pedantic (but I think important) note, CC0 is not a license, but (in the US) a declaration tool for the public domain--there are no terms to be adhered to, but rather it is a public record that the author of the work is saying they are giving it to all to use however they want, without license. I think it is nice (and a good idea) to note that for one's records, but one could not note it as well. (BTW, did you possibly flip M1 and M2 in your example above? I'm not sure why CC0 would have more metadata than CC BY 4.0)
Looking at the best practices as an example from the above: Title, Author, Source, and License. creator_id and mapping_set_id feel like they may cover T and A, but if not something more explicit could be added. license is L. S may in fact be the mapping_set_id as well, but I think that a link/URI might be better. (Doubly so as we get into the next bit.)
Looking at the actual wording of how attribution must be specified as defined in the text of the license (https://creativecommons.org/licenses/by/4.0/legalcode):
A. retain the following if it is supplied by the Licensor with the Licensed Material:
i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of warranties;
v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
As long as the derived_from is always additive, I think that would keep a pretty good record of the changes in an unambiguous way.
A good summary of requirements for 4.0 (which is materially different than previous versions) can be found from CC here: https://wiki.creativecommons.org/wiki/License_Versions#Detailed_attribution_comparison_chart . I feel like what you have covers a lot of that. I wonder about things like license files (e.g. LICENSE.md) and how to concat them, but that may not apply to your use case.
I'm assuming that the proposed derived_from field above is embedded in M3 itself?
Yes.
Wow, ok.. We are using these license all day every day and no one I know has ever read them.. The legal code on CC-BY is pretty clear I would say, but it seems really inconvenient to having to include a "copyright license" separate to the reference to the CC-BY license itself. Do you understand what exactly that would mean for this example?
- retain the following if it is supplied by the Licensor with the Licensed Material:
- [X] identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
- [ ] a copyright notice: how would that look like?
- [X] a notice that refers to this Public License: the CC-BY link
- [ ] a notice that refers to the disclaimer of warranties: how would that look like?
- [X] a URI or hyperlink to the Licensed Material to the extent reasonably practicable: the mapping_set_id should be a purl that resolves to its location
I know no one that does this:
- indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
Should we?
"We are using these license all day every day and no one I know has ever read them.." Yeah, that's pretty much what the (Re)usable Data Project was all about: raising awareness and trying to lay down a framework to have the tough conversations, like this one :) I also read and tried to map out the terms of like a hundred different licenses, so something as clear and well annotated as the CC licenses always make me smile.
Again, this gets into me not being a lawyer and this not being legal advice, but I think context and what would be considered reasonable for the context is worth looking at. From my understanding, these are mapping files with embedded metadata that are intended to be used as such and did not have a separate license file to begin with, I think that you could probably just continue to follow that pattern if it seems like that's what the authors originally intended. For you unchecked items:
a copyright notice: how would that look like?
Well, in the metadata you are already letting people know that the material is copyrighted under CC license by the authors, so that seems like most of it. I think that if you end up having a LICENSE file around or other copyright notices that you'd include them, if necessary. I'd assume that could be automatically generated from the metadata.a notice that refers to the disclaimer of warranties: how would that look like?
From the CC Best Practices wiki page:Lastly, is there anything else I should know before I use it?
When you accessed the material originally did it come with any copyright notices; a notice that refers to the disclaimer of warranties; or a notice of previous modifications? (That was a mouthful!) Because that kind of legal mumbo jumbo is actually pretty important to potential users of the material. So best practice is to just retain all of that stuff by copying and pasting such notices into your attribution. Don't make it anymore complicated than it is -- just pass on any info you think is important.Implied there is that if it isn't there, you don't need it--it's all about respecting the wishes of the authors. I would also point out that there is a disclaimer of warranties in the license itself as well (Section 5), so unless the original author needed/wanted something more, that may suffice.
indicate if You modified the Licensed Material and retain an indication of any previous modifications
aren't you already doing this with thederived_fromfield? You capture both the fact that it was modified and what the data streams are. I don't think this is asking for a diff (see the Best Practices examples from before).
To take a little step up from the details, the point of all these licenses is to understand and then respect the wishes of the authors/copyright holders within the context that all this is taking place of their data. I think that if you are taking all this into account and working with clear CC licenses and going along with their terms that you are unlikely to run into issues as this is exactly the kinds of things that the authors wanted for their work in the first place: credit and passing it along to the next person with the same rights.
Ok. This is very helpful. I am assuming that as long as there is no metadata element that captures additional stuff like warranties and anything beyond cc on the copyright issue, then I don't need to worry about copying it. We simply restrict the potential of data providers to express themselves to a simple reference to a license, and that we will preserve. If users in the future need to express more than that, they will first have to request a suitable metadata element, and then we can refer back to this issue here.
I know that restricting the metadata, legally, is probably not enough. If somewhen were to publish a mapping file with a LICENSE.md that would probably overrule all of what I posit here, but the idea here should be that the data files are self-contained, so no information relevant to their usage should live outside of it.
Thank you very much @kltm!
@matentzn That's pretty much my lay read of it. Of course, this would only be for talking about CC BY 4.0 and CC0. Trouble could happen if people do things like say "refer to the license in this directory" and the like, but as long as the license is just a URL, it makes keeping the reference easy.