tesseract
tesseract copied to clipboard
RFC: Use SPDX license identifier instead of long license plate and clean comments in file headers
From @egorpugin:
// SPDX-License-Identifier: Apache-2.0
https://spdx.org/licenses/ Linux uses them already. https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpuid.c#L1
I created this issue from @egorpugin's project list item to allow broader discussion.
Using SPDX license identifiers would reduce the source code lines by removing redundant information from the comments in the file headers (1 line instead of 10 lines).
We could start with the public header files in include/tesseract
, use it for new files and then extend the changes to the remaining files as well.
I think it can be automated in case if we remove other info in headers (date, author etc.). I already have several sed/awk scripts for such updates in my projects.
Note that the planning "Tesseract next" now includes additional columns for bug fixes and new features for release 5. If needed we could also create a separate project planning "Tesseract future" for release 6.
I think it can be automated in case if we remove other info in headers
Should such additional information be removed? Do we have to preserve it? I don't know the answer and would therefore start with removing the license text only.
Let's have a closer look on that other information and my personal opinion:
-
File:
can be removed without loss of information -
Description:
should be kept as it is useful documentation -
Author:
??? - Copyright and year ???
- License text can be replaced by SPDX license identifier
-
History:
can be removed -
Created:
can be removed
Author probably should present in git history but it can't be true in all cases. Copyright is used usually. For example, I have it https://github.com/SoftwareNetwork/sw/blob/b0.4.3/src/sw/manager/database.cpp#L1
Author: ???
Didn't you remove such lines in the past?
IMO, Author and copyright notice should be kept for legal reasons.
Yes, I remove History
and Created
in files where I make other changes.
My point was that you removed developers names AFAIR. I hope all of them are at least mentioned in AUTHORS.
I think most, if not all the removed named were HP employees. Maybe also some Googlers.
You are right, I did not remember that I removed authors in a few cases. I now found the 3 commits b5498c70fa0b171cb952e04c5d9176a09c70963b (new file content), d960a50c12c4b991d0a86ff4c1fb9b05fd580aae and 6d170a15ec7ca0950fc69734ed586ffe6465f9ca (uncertain author). So all of them were very special cases. The removed name is mentioned in doc/tesseract.1.asc.
So do we agree to
- replace the license text by SPDX license identifier
- remove
File:
,History:
andCreated:
- keep the rest of the header comments
?
IANAL, but IMO, It's not a good idea to remove people names from files. We don't have to credit contributors, but once a name appears in a file, it should not be removed without explicit permission from that person. Also, if a contributor wants to have a credit in AUTHORS we must give him that credit, unless his contribution is tiny like a one line typo fix.
https://en.wikipedia.org/wiki/Moral_rights
@stweil Agreed.
We can revisit authors later. On heavily modified code new authors should be added following current approach, but I don't like big list of authors in code.
I think these changes should go into next v6 (main) branch. Let v5 keep old file headers.
https://lwn.net/Articles/739183/
https://www.linuxfoundation.org/blog/solving-license-compliance-at-the-source-adding-spdx-license-ids/
https://reuse.software/
https://spdx.dev/
https://www.kernel.org/doc/html/v4.18/process/license-rules.html
https://github.com/torvalds/linux/blob/master/LICENSES/preferred/BSD-3-Clause (One example for a license).
Let v5 keep old file headers.
I never had noticed SPDX before your comment in the planning list, but now saw that it is already used rather often. Obviously many users (mostly companies?) need some reliable and parsable license information. Therefore I think it would be good to add at least the SPDX-License-Identifier
to the public header files like it is done in pull request #3689.
Sure.