mne-python icon indicating copy to clipboard operation
mne-python copied to clipboard

Simplify license headers in source files

Open cbrnr opened this issue 2 years ago • 74 comments

Currently, our source files show individual authors as well as the license in a comment near the top, for example:

# Authors: Jaakko Leppakangas <[email protected]>
#          Robert Luke <[email protected]>
#
# License: BSD-3-Clause

I think this is not ideal for the following reasons:

  1. It is not trivial to keep the list of authors up to date. Since this information is available from Git anyway, it is also not necessary.
  2. I don't think we have always updated these lists, so currently this information is likely outdated.
  3. There's always the question what kind of change(s) qualifies an author to get added to that list.

I suggest that we replace our headers with the following license text across all source files:

# Authors: MNE-Python development team
# License: BSD-3-Clause

The exact wording can be changed of course (feel free to make suggestions), but the point is to get rid of individual authors.

cbrnr avatar Mar 28 '23 07:03 cbrnr

This was also brought up by @drammock in https://github.com/mne-tools/mne-python/pull/9569#issuecomment-879969307 some time ago.

I generally agree. Do we need a copyright notice in each file? Isn't the license in the root of the repository enough, to cover it all?

sappelhoff avatar Mar 28 '23 08:03 sappelhoff

Removing is even better. I don't think every file needs to have the license when there is a license in the root.

cbrnr avatar Mar 28 '23 09:03 cbrnr

I don't think every file needs to have the license when there is a license in the root.

I was originally on board with this until I read this stackexchange thread where people sometimes (usually?) generally recommend that you do include the license so that a copied-away file still makes it clear what the license is. So I'm +0 on removing the BSD 3-clause line, I'm fine with keeping it or removing it.

Last time the "authors" stuff came up, IIRC (I can't find the thread, maybe it was in person at a dev meeting!) most people thought "yes these should go away" until @agramfort pointed out that it's a nice way for newer contributors to get some credit for what they do. I think we conveged on getting rid of the authors in mne/ but keeping those lines in examples/ and tutorials/ because -- even though they, too, end up out of date -- it's nice for newer and/or less experienced people to have their name visible on something that people are more likely to read (i.e., an example or tutorial) than our whats_new.rst.

larsoner avatar Mar 28 '23 13:03 larsoner

Good points @larsoner. Then I'm also +0 on removing the license line.

I'm also OK with removing authors only in mne/. If we keep them in examples/ and tutorials/, we should define what kind of changes warrant adding a new author. Are we OK with a typo fix? Or does it have to be more substantial? This discussion could be avoided if we remove authors in all sources, which I'd still prefer. We now highlight new authors in our changelog, which I believe we didn't do when @agramfort made this comment.

cbrnr avatar Mar 28 '23 13:03 cbrnr

We now highlight new authors in our changelog, which I believe we didn't do when @agramfort made this comment.

Looking back, looks like it was actually @rob-luke who advocated for keeping the examples/tutorial authors here in the same thread linked to before in July 2021 (I had just missed it). And back then we already were highlighting new contributors in the release April 2021 (and even if we didn't, I think the changelog is a lot less visible to end-users in general than examples/tutorials).

we should define what kind of changes warrant adding a new author. Are we OK with a typo fix? Or does it have to be more substantial?

I think people will generally be self-policing here and add their name if they think they did something worthwhile. I think we can take it case-by-case if something concerning does come up rather than trying to codify a policy here.

larsoner avatar Mar 28 '23 13:03 larsoner

I think the changelog is a lot less visible to end-users in general than examples/tutorials

this is the point (originally made by @rob-luke as @larsoner mentioned) that convinced me to keep authors on tutorials/examples, even though I still dislike the fact that they can get stale/out of date/inaccurate.

I like the idea of removing authors from the source .py files within mne/ though. Anyone who is reading the source code I think it's fair to assume they at least know about changelogs, and many will know about git blame too. Plus GitHub shows avatars of all authors if you view the source on GitHub, so that's another way authors are visible for source files.

I'm +0.5 to keep the license lines.

drammock avatar Mar 28 '23 14:03 drammock

my take on this and why I think it's valuable to keep names:

  • having your name on a file gives a sense of code ownership and responsibility.
  • names on top of files survives github and where the code is stored.
  • I've been contacted by people to ask questions in sklearn code as they found my name and email on the files.
  • for some contributors it's important to them to have their name. There can be a sense of pride to have your name there and can maybe encourage them to contribute more.

my 2c

Message ID: @.***>

agramfort avatar Mar 28 '23 16:03 agramfort

I've been contacted by people to ask questions in sklearn code as they found my name and email on the files.

I have too, but this is one reason I think it's bad to have the names there, because:

  1. in general I think we want to discourage users from sending emails directly to specific devs --- questions/support requests should go through our preferred channels (discourse / GitHub) so that currently active maintainers will see the message.
  2. Assuming names are in chronological order, if folks email the first name on the list they are in fact more likely to get someone who hasn't touched that code in a while.

having your name on a file gives a sense of code ownership and responsibility

Is this your personal feeling, or have contributors told you that they feel this way? (how many?) If we want to cultivate that feeling of responsibility, we could enforce / strongly encourage using the GitHub codeowner feature

for some contributors it's important to them to have their name. There can be a sense of pride to have your name there and can maybe encourage them to contribute more.

is the sense of pride from seeing your name in the file itself any greater than seeing your name on the associated commit(s)? I feel like we don't really know whether people are adding their names because they feel pride, or because they saw other names and assumed they were supposed to do it.

drammock avatar Mar 28 '23 17:03 drammock

names on top of files survives github and where the code is stored.

I find this to be the most important reason to keep names ... when files are re-organized "git blame" stops working. E.g.: https://github.com/mne-tools/mne-python/blob/main/mne/report/report.py ... I don't see my name in the git commits

image

even though I might be able to answer questions related to the MNE-report better than the average contributor. For functionality that is old, this may be even more critical as it might be hard to dig up who originally wrote that part of the code.

jasmainak avatar Mar 28 '23 19:03 jasmainak

ok, I withdraw my objection to names in the source files then. It's weird that you aren't listed by GitHub as a contrib @jasmainak; you show up in the blame of that file:

Screenshot 2023-03-28 at 15-12-04 mne-python_mne_report_report py at main · mne-tools_mne-python

drammock avatar Mar 28 '23 20:03 drammock

So summarizing what I think is an emerging consensus (?):

It is not trivial to keep the list of authors up to date. Since this information is available from Git anyway, it is also not necessary.

There are social reasons (pride, fame/publicity, guilt/ownership) to keep them there.

I don't think we have always updated these lists, so currently this information is likely outdated.

The harm caused by outdated information is minimal / consists of inactive devs getting unwanted emails, and/or users having their emails go unanswered. I still think this could be avoided; maybe we include names but not emails?

There's always the question what kind of change(s) qualifies an author to get added to that list.

@larsoner suggests that the honor system / letting people decide for themselves is good enough here. I won't quibble with that.

So in the end maybe there is nothing to be done here @cbrnr ? Unless you want to push for removing just the emails and keeping the names... but IDK if that will be any less contentious.

drammock avatar Mar 28 '23 20:03 drammock

I still think names in the source code only make sense if they are kept up to date. I rarely bother to put my name there, but now this makes it look as though I did not contribute. I can imagine that this is the case for others as well.

@drammock made some good points: people should not be looking at the source to find developers and ask questions. They should use our official channels. If these are not obvious, we can make them more obvisous in our documentation.

The sense of ownership and responsibility can be instilled by other means, and I think we should try to find something that works automatically. The spirit of open source is not that someone "owns" a piece of code, but we as a community have a shared responsibility for everything we create in MNE-Python. Putting some names in the sources seems to be biased and unfair unless we keep everything up to date all the time (which is not feasible).

Even if some previous contributor knows most about a specific topic I trust current developers to either answer the question or forward it to someone who they think will be able to help.

So I still think that removing names (but keeping the license) in mne/ (but keeping them in examples/ and tutorials/) is the better option.

cbrnr avatar Mar 29 '23 05:03 cbrnr

@cbrnr i would say add your name to the files you feel a clear ownership for. I don't want your name in the files then don't.

agramfort avatar Mar 29 '23 06:03 agramfort

i would say add your name to the files you feel a clear ownership for. I don't want your name in the files then don't.

It's not about me. It's that I don't like it in general for the reasons I mentioned.

So this is settled? No chance that something can be done here?

cbrnr avatar Mar 29 '23 07:03 cbrnr

Regarding the comment by @jasmainak, you show up in the blame of the file you linked to, so git blame does not stop working when files are re-organized. GitHub shows the most recent contributer per line. It's the list that's buggy, we should probably report that to GitHub. Because that file was renamed, you need to go to the previous file, which is linked in the list of commits.

But this file demonstrates the issue I have: @hoechenberger and @larsoner have contributed a lot, but they are not mentioned at the top of the file. Whether or not that's intentional is also beside the point, because if there is a list this indicates to others that these are the developers who contributed most and are responsible for that file. Which is simply not true.

cbrnr avatar Mar 29 '23 07:03 cbrnr

Here's a shell command that prints out all contributors of a file:

git --no-pager log --follow --pretty=format:"%aN <%ae>" mne/report/report.py | sort | uniq

The important bit is the --follow option, which makes it work across renames.

If you want to preserve the order of contributions:

git --no-pager log --follow --pretty=format:"%aN <%ae>" mne/report/report.py | nl | sort -uk2 | sort -nk1 | cut -f2-
Dimitri Papadopoulos Orfanos <[email protected]>
Eric Larson <[email protected]>
Daniel McCloy <[email protected]>
Mathieu Scheltienne <[email protected]>
Clemens Brunner <[email protected]>
Mathieu Scheltienne <[email protected]>
Richard Höchenberger <[email protected]>
Alex Rockhill <[email protected]>
Jeff Stout <[email protected]>
Marijn van Vliet <[email protected]>
Alexandre Gramfort <[email protected]>
Stefan Appelhoff <[email protected]>
Rob Luke <[email protected]>
Valerii Chirkov <[email protected]>
Martin Schulz <[email protected]>
Guillaume Favelier <[email protected]>
Dmitrii Altukhov <[email protected]>
Alex Rockhill <[email protected]>
Daniel McCloy <[email protected]>
Joan Massich <[email protected]>
Jaakko Leppakangas <[email protected]>
Mainak Jas <[email protected]>
Denis A. Engemann <[email protected]>
Teon Brooks <[email protected]>
Teon Brooks <[email protected]>
Jean-Remi King <[email protected]>
Mainak Jas <[email protected]>
Denis A. Engemann <[email protected]>

cbrnr avatar Mar 29 '23 08:03 cbrnr

But this file demonstrates the issue I have: @hoechenberger and @larsoner have contributed a lot, but they are not mentioned at the top of the file. Whether or not that's intentional is also beside the point,

I'm okay with having the names of the original authors of a module / function listed on top. I don't care if I'm listed even if I contributed a lot, for as long as the commit history is preserved. Should we at some point be forced to kill the history, we should add all contributors to the top.

Makes me wonder about the planned transition to Black though, where we'll touch many, many lines of the code. I'm wondering if we could associate this gigantic commit with a "virtual" contributor / bot (instead of a real developer account). But this just as an aside.

hoechenberger avatar Mar 29 '23 08:03 hoechenberger

I'm okay with having the names of the original authors of a module / function listed on top. I don't care if I'm listed even if I contributed a lot, for as long as the commit history is preserved. Should we at some point be forced to kill the history, we should add all contributors to the top.

But some (new) contributors might care, but not dare to ask to include their name because they might think that only original contributors or "worthy" contributions should go in this list.

That's exactly my suggestion, either we copy the full list to the header, or none. Everything in between is not fair.

Coming back to a point I thought we'd converged on (also in a previous discussion): I would leave authors in tutorials and examples, but not in sources. This was the consensus in the linked discussion in my opinion.

Re Black, it is possible to ignore revisions - is that not sufficient? GitHub does support it (https://docs.github.com/en/repositories/working-with-files/using-files/viewing-a-file#ignore-commits-in-the-blame-view).

cbrnr avatar Mar 29 '23 09:03 cbrnr

I agree with @agramfort here: it can be valuable for some, especially junior, developers to have their name in the source files. In my opinion, this should be marked by "substantial contribution".

The reason why I think this is valuable is the following: having a name on a source file is an easy way to showcase that one has substantially contributed to a specific method etc. Recruitment committees of various disciplines will probably not use a command line tool or other git/GitHub functionality to figure out contributions. However, adding a link to a file where your name is up to is easy to do. Furthermore, people who look at the source code locally will make a connection with the names - thus increasing visibility for the contributors.

Now I agree with @cbrnr that the practice how this is done is disadvantaging contributors who are less inclined to just put themselves on the files. I wonder if what we actually need is guidelines what makes a contribution substantial enough to have your name on the file (I also see how this would always be an arbitrary cut-off).

I am against having all names up, as this would add a lot of text for potentially less substantial commits (e.g., having fixed a typo in the docstring etc.) and would decrease visibility.

Another possibility could be to outsource the contributor list to another file we keep on our homepage - that could fix it for my first point, but again not for the visibility.

britta-wstnr avatar Mar 29 '23 09:03 britta-wstnr

But some (new) contributors might care, but not dare to ask to include their name

Ok right, very valid point. I've been in this situation before and I never knew when it's justified to add my name.

hoechenberger avatar Mar 29 '23 09:03 hoechenberger

git --no-pager log --follow --pretty=format:"%aN <%ae>" mne/report/report.py | nl | sort -uk2 | sort -nk1 | cut -f2-

If you can add this to a commit hook that automatically runs only on the changed files and then auto updates the author list (and deduplicates with our mailmap!) then I think that would be quite a good solution. Names who only made small typo fixes will still be in there but I'd rather err in that direction than to not include names who made large contributions

drammock avatar Mar 29 '23 12:03 drammock

Mh @drammock I'd argue that then the name list become meaningless, no? For all points mentioned here: ownership, visibility, pride, ...

britta-wstnr avatar Mar 29 '23 12:03 britta-wstnr

I'd argue that then the name list become meaningless, no? For all points mentioned here: ownership, visibility, pride, ...

But everyone who contributed has the same right to get ownership, visibility, etc.?

Since I wanted to avoid defining what an "important" contribution is, the best solution IMO is to remove author names from source files and try to add other means to provide these perks. For everyone.

cbrnr avatar Mar 29 '23 13:03 cbrnr

Well, my point is that with all names on there, it would not work that way. Would someone feel ownership/responsibility if there are 15+ other names? Probably no. Would someone get visibility if there are 15+ other names? Probably also no.

But I do think that if someone made substantial changes to a part of the code base, they should be able to get recognition for that beyond having their name in a long list of names or not mentioned at all. At this point I also want to throw in that this might be especially important for marginalized groups in open source, who might not be just "believed" to have "really contributed".

britta-wstnr avatar Mar 29 '23 13:03 britta-wstnr

Having observed this a bit, I want to revise my opinion. I think we should:

  • use CODEOWNERS for "ownership" --> this allows people to get notifications when files are touched
  • use our changelog for "credit"/"attribution" --> after all it's more relevant to be able to point to a specific PR that you have implemented, rather than say "I worked on this file in general"
    • I think what would be really awesome would be a way to filter the changelog by contributor ... something like: list me all contributions by @example-person across all MNE versions
    • ideally, such a query would be associated with a specific URL that could be shared. Then I could send someone a URL and they could see all my contributions with links to the specific PRs.
  • git blame for specifics (the "black" commit can be filtered via a .git-blame-ignore-revs file, which does also work on GitHub, next to the CLI)
  • keep author names in tutorials and examples, AND write some loose guidelines (preferably examples) when a contribution is considered big enough to add a name there ... OR add a "author contributions" section to each tutorial/example ... something like the CREDIT taxonomy
  • remove names and emails from files, and only keep the license abbreviation there --> credit/ownership/blame is IMHO properly solved by the above mentioned items

sappelhoff avatar Mar 29 '23 13:03 sappelhoff

I agree with @sappelhoff 100%!

cbrnr avatar Mar 29 '23 14:03 cbrnr

@britta-wstnr

But I do think that if someone made substantial changes to a part of the code base, they should be able to get recognition for that beyond having their name in a long list of names or not mentioned at all.

While I do agree, I also see the difficulty here that @cbrnr mentioned: how do we quantify that? How big or important is "substantial"? Changed lines of code (how many?)? Performance improvements (how much faster?)? Continuous maintenance of a module (for how long?)? Do the barriers we set here depend on seniority of the contributor (beginners -> lower, senior devs -> higher barrier?)?

There must be a clear path that contributors can follow to get their name added at the top-level. Otherwise it'll be a recipe for conflict and disappointment.

hoechenberger avatar Mar 29 '23 14:03 hoechenberger

@hoechenberger as mentioned in one of my earlier comments, I agree with that as well. However, that also is a problem for tutorials and examples as @sappelhoff acknowledges and proposes to write guidelines for - i.e., we have to solve that anyway.

britta-wstnr avatar Mar 29 '23 14:03 britta-wstnr

@britta-wstnr

But I do think that if someone made substantial changes to a part of the code base, they should be able to get recognition for that beyond having their name in a long list of names or not mentioned at all. At this point I also want to throw in that this might be especially important for marginalized groups in open source, who might not be just "believed" to have "really contributed".

I am very sympathetic to and supportive of these points. But until we have a proposal for how "substantial" is defined and adjudicated, I feel I cannot support inclusion/exclusion of any names --- either show all names, or show none.

I think several of us agree that letting the contributor decide whether or not to include their name isn't a very good solution: it disadvantages anyone feeling shy/humble/insecure and advantages anyone feeling arrogant/confident/entitled. But I also really don't want the mechanism to be "whoever is reviewing the PR decides" --- clearly there are differences of opinion among members of our steering council about how this should be handled, and whether a contributor gets credit in this way should not be determined by luck of the draw in who is available for PR reviews.

@sappelhoff

ideally, such a query would be associated with a specific URL that could be shared. Then I could send someone a URL and they could see all my contributions with links to the specific PRs.

This is already available: https://github.com/mne-tools/mne-python/commits?author=sappelhoff

keep author names in tutorials and examples, AND write some loose guidelines

See above point; "loose guideline" does not, IMO, eliminate the bias toward advantaging some contribs and disadvantaging others, and it adds overhead for devs to adjudicate (or at least "keep an eye out" for whether names were (not) added (in)appropriately).

drammock avatar Mar 29 '23 14:03 drammock

Examples and tutorials are typically touched by far fewer individuals, so it should not be a problem to include all names. Even when someone "just" fixes typos (BTW I think that fixing typos is extremely valuable). So our guideline should be to include everyone who modified an example or tutorial.

cbrnr avatar Mar 29 '23 14:03 cbrnr