tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

RFC: Use SPDX license identifier instead of long license plate and clean comments in file headers

Open stweil opened this issue 3 years ago • 16 comments

From @egorpugin:

// SPDX-License-Identifier: Apache-2.0

https://spdx.org/licenses/ Linux uses them already. https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpuid.c#L1

stweil avatar Dec 23 '21 20:12 stweil

I created this issue from @egorpugin's project list item to allow broader discussion.

Using SPDX license identifiers would reduce the source code lines by removing redundant information from the comments in the file headers (1 line instead of 10 lines).

We could start with the public header files in include/tesseract, use it for new files and then extend the changes to the remaining files as well.

stweil avatar Dec 23 '21 20:12 stweil

I think it can be automated in case if we remove other info in headers (date, author etc.). I already have several sed/awk scripts for such updates in my projects.

egorpugin avatar Dec 23 '21 20:12 egorpugin

Note that the planning "Tesseract next" now includes additional columns for bug fixes and new features for release 5. If needed we could also create a separate project planning "Tesseract future" for release 6.

stweil avatar Dec 23 '21 20:12 stweil

I think it can be automated in case if we remove other info in headers

Should such additional information be removed? Do we have to preserve it? I don't know the answer and would therefore start with removing the license text only.

Let's have a closer look on that other information and my personal opinion:

  • File: can be removed without loss of information
  • Description: should be kept as it is useful documentation
  • Author: ???
  • Copyright and year ???
  • License text can be replaced by SPDX license identifier
  • History: can be removed
  • Created:can be removed

stweil avatar Dec 23 '21 20:12 stweil

Author probably should present in git history but it can't be true in all cases. Copyright is used usually. For example, I have it https://github.com/SoftwareNetwork/sw/blob/b0.4.3/src/sw/manager/database.cpp#L1

egorpugin avatar Dec 23 '21 20:12 egorpugin

Author: ???

Didn't you remove such lines in the past?

IMO, Author and copyright notice should be kept for legal reasons.

amitdo avatar Dec 24 '21 09:12 amitdo

Yes, I remove History and Created in files where I make other changes.

stweil avatar Dec 24 '21 09:12 stweil

My point was that you removed developers names AFAIR. I hope all of them are at least mentioned in AUTHORS.

I think most, if not all the removed named were HP employees. Maybe also some Googlers.

amitdo avatar Dec 24 '21 10:12 amitdo

You are right, I did not remember that I removed authors in a few cases. I now found the 3 commits b5498c70fa0b171cb952e04c5d9176a09c70963b (new file content), d960a50c12c4b991d0a86ff4c1fb9b05fd580aae and 6d170a15ec7ca0950fc69734ed586ffe6465f9ca (uncertain author). So all of them were very special cases. The removed name is mentioned in doc/tesseract.1.asc.

stweil avatar Dec 24 '21 10:12 stweil

So do we agree to

  • replace the license text by SPDX license identifier
  • remove File:, History: and Created:
  • keep the rest of the header comments

?

stweil avatar Dec 24 '21 10:12 stweil

IANAL, but IMO, It's not a good idea to remove people names from files. We don't have to credit contributors, but once a name appears in a file, it should not be removed without explicit permission from that person. Also, if a contributor wants to have a credit in AUTHORS we must give him that credit, unless his contribution is tiny like a one line typo fix.

https://en.wikipedia.org/wiki/Moral_rights

amitdo avatar Dec 24 '21 10:12 amitdo

@stweil Agreed.

We can revisit authors later. On heavily modified code new authors should be added following current approach, but I don't like big list of authors in code.

egorpugin avatar Dec 24 '21 11:12 egorpugin

I think these changes should go into next v6 (main) branch. Let v5 keep old file headers.

egorpugin avatar Dec 24 '21 11:12 egorpugin

https://lwn.net/Articles/739183/

https://www.linuxfoundation.org/blog/solving-license-compliance-at-the-source-adding-spdx-license-ids/

https://reuse.software/

https://spdx.dev/

https://www.kernel.org/doc/html/v4.18/process/license-rules.html

https://github.com/torvalds/linux/blob/master/LICENSES/preferred/BSD-3-Clause (One example for a license).

amitdo avatar Dec 24 '21 11:12 amitdo

Let v5 keep old file headers.

I never had noticed SPDX before your comment in the planning list, but now saw that it is already used rather often. Obviously many users (mostly companies?) need some reliable and parsable license information. Therefore I think it would be good to add at least the SPDX-License-Identifier to the public header files like it is done in pull request #3689.

stweil avatar Dec 25 '21 15:12 stweil

Sure.

egorpugin avatar Dec 25 '21 16:12 egorpugin