reuse-tool
reuse-tool copied to clipboard
Usecase: automatically update copyright year headers on modified files
We have a repository where most files have a single copyright holder (us), but some specific files like the code of conduct have been copied and are attributed to others. Now when a new year rolls around the copyright annotations have to be updated.
I don't care as much if we update the copyright annotation on all files or just do it incrementally, as long as it is little effort. What method is recommended and should REUSE be improved to deal with this?
I can think of different approaches for this:
- Strip date annotations from the files and skip the problem altogether.
- Skip REUSE and just apply a regex with
sedto replace* SPDX-FileCopyrightText * 2022 * my organizationwith* SPDX-FileCopyrightText * 2023 * my organization. - Run
reuse addheader --recursive --merge-copyrightswith the new year, but that will add headers to files that don't contain our copyright. So perhaps REUSE should support file filters or globbing to exclude certain files from being modified. - Run a
grepcommand withSPDX-FileCopyrightText * 2022 * my organizationto filter the files with the copyright and pipe them to thereuse addheader --merge-copyrightscommand. - Implement a git hook that
reuse addheader --merge-copyrightswill be run for all changed files for each commit. This seems the most neat solution
How do other people deal with this issue? What can be recommended? Has somebody already written a git hook to automatically apply copyright to modified files and willing to share it?
Thanks for writing this issue, Nico. A few notes:
- I'm not sure I'm following this one. By "date annotation" do you mean the entire
SPDX-FileCopyrightTextheader? Stripping these here would mean having to specify all of this metadata in the.reuse/dep5file I presume? But that wouldn't fix the problem. I think I don't understand. - I'm not a fan of this one. I like the idea of having a solid tool like REUSE with well thought out in/out states to deal with these things. Oh and by the way,
--merge-copyrightswould update my2022to be2022 - 2023in the next year. But of course that could still be achieved with this method, although this gets dirty quickly. - This sounds reasonable. I might actually like this one best, despite there being some extra maintenance. It's simple, clear and stateless: you explicitly tell what files need processing and which ones don't, and no matter what the state of your project is (new files, deleted, changed, etc.) this will work fine, is transparent and simple. The challenge here however, is where you maintain the files to include/exclude (patterns/paths/whatever). Sure, you can repeatedly supply this in the CLI args, but it feels like some file in
.reuse/might need to be added for this. - Not bad, but I would again prefer REUSE to do the work. If it supports
--recursive, it should perhaps support more flexible selection mechanisms for leaving files in and out. - Clever, although my hunch feeling fears there's some complexity here. i. How do you deal with new files? Ths script has no way of telling whether you want to process it or not. And even if you have a nice answer for that situation, I still dislike the stateful complexity of having to deal with new files differently from currently existing files. If I'm going to specify which ones are meant to be processed, I'd rather do that in the simple way of option 3. ii. What if I don't touch certain files for years? With option 3 everything gets updated for free.
So I'd say 3 or 5 are interesting with their own challenges, but I prefer 3 right now.
Related issue it seems: #534
I agree 3 and 5 make most sense, not having to rely on outside hacks.
To clarify, with the first option I was thinking to replace:
# SPDX-FileCopyrightText: 2019 Example Company
With:
# SPDX-FileCopyrightText: Example Company
So excluding the date reference. You loose information in favor of simplicity.
I think we should do a refactoring on copyright statements more generally, to solve a whole host of adjacent problems. This is a loose proposal:
Instead of treating copyright statements as strings exclusively, let's make a (data)class CopyrightStatement for them, containing the attributes copyright_prefix: str, year_range: Optional[YearRange], copyright_holder: str, copyright_holder_contact: Optional[str], and original: Optional[str]. Make sure the __str__ function is properly implemented to return {copright_prefix} {year_range} {copyright_holder} {copyright_holder_contact} (or without year_range and/or copyright_holder_contact if undefined).
We'll also need to create a class YearRange. It contains a list of years, and __str__ renders, say, 2011, 2016-2018 from [2011, 2016, 2017, 2018]. Or 2011-2018—we can discuss this, and we can probably make this configurable.
Both of these classes need to be able to cleanly parse a whole host of ways of writing copyright statements.
Anyway, once we've done that refactoring, we'll be able to do so much more so much more easily. © 2011 Jane Doe and SPDX-FileCopyrightText: 2011 Jane Doe <[email protected]> are effectively identical, and we could just merge them. When adding © 2013 Jane Doe to a file containing © 2011 Jane Doe, we could merge them and (a.) merge the dates, (b.) pick the lowest date, or (c.) pick the highest date, configurable via some flag to addheader.
We could also write the logic required for this issue quite easily. We'd probably need to create a new command for it, or you could just write a simple Python script that does import reuse and uses the above scaffolding to do what you want.
That sounds like a better data representation indeed! I can see how that can make things more robust and flexible. I'm not sure I follow how that would lead to an easy solution for this issue though, particularly because of the following:
In your project you may have files that need entirely different treatment:
- some aren't yours and you don't contribute to them. These files need to be skipped entirely.
- some aren't yours but you contribute to them. To these you may want to append your own copyright info.
I hope I'm being clear enough :).
I think it would help in two ways:
- prevent double copyright statements if an updated copyright annotation is added. As there is more certainty they will be merged.
- have a strong datastructure to build new features like updating the copyright year of a certain copyright holder. Not sure if that feature is desirable, but at least reading copyright holders will be done with greater certainty.
In today's call we once again noticed that the data change @carmenbianca brought up in her comment (https://github.com/fsfe/reuse-tool/issues/561#issuecomment-1189925301) would be a great feature, but it would also involve a lot of work. Depending on the outcome of #536 this might even be more helpful.
We put it in the backlog in the hope that we find time for this one day.