tools-python icon indicating copy to clipboard operation
tools-python copied to clipboard

add support for multiple file checksums, file types

Open jotterson opened this issue 4 years ago • 9 comments

This change allows multiple checksums and file types on Files. This improves compliance with the SPDX 2.1 and newer specifications.

This was done to allow support for modern cryptographic hashes, e.g. sha256.

All 250 of the tests runs. The multiple file checksum and file types works with all storage types.

Signed-off-by: Jeffrey Otterson [email protected]

jotterson avatar Nov 26 '21 14:11 jotterson

I have an overlapping pull request at https://github.com/spdx/tools-python/pull/197 but I did not implement parsing.

Nice to see that you were able to implement parsing for all formats.

Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

dholth avatar Nov 29 '21 15:11 dholth

I have an overlapping pull request at #197 but I did not implement parsing.

Nice to see that you were able to implement parsing for all formats.

Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

jotterson avatar Nov 29 '21 22:11 jotterson

I have an overlapping pull request at #197 but I did not implement parsing. Nice to see that you were able to implement parsing for all formats. Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

I don't know about file type cardinality but there are a couple of more mundane details like the placement of the checksum field and name of algorithms in json, and the presence or absence of certain fields.

    from jsonschema import validate
    validate(spdx_data, json.load(open("spdx-schema.json")))

dholth avatar Nov 29 '21 22:11 dholth

I have an overlapping pull request at #197 but I did not implement parsing. Nice to see that you were able to implement parsing for all formats. Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

I don't know about file type cardinality but there are a couple of more mundane details like the placement of the checksum field and name of algorithms in json, and the presence or absence of certain fields.

    from jsonschema import validate
    validate(spdx_data, json.load(open("spdx-schema.json")))

Thanks for that. I still don't have the schema verifying, but I have fixed some of the schema validation problems (extra 'checksumAlgorithm_' text, etc.) I will merge what I have into my branch soon. I tried to put the schema validation right into the tests, but I think that unless I have your flatten and unflatten functions, that is not going to work right. I will note that I do have read/write for all file formats working. Jeff.

jotterson avatar Dec 01 '21 20:12 jotterson

@jotterson Thank you ++ .. let me review in details possibly later this week

pombredanne avatar Dec 15 '21 14:12 pombredanne

What is the status of this PR? I can assist if needed or I can make a separate PR from work I have done late last year that I believe solves the same issue.

0t1s1 avatar Jan 28 '22 02:01 0t1s1

I would still like to get this PR committed, somebody has to approve it. The changes are useful, and the tests all pass. Jeff.

jotterson avatar Feb 01 '22 00:02 jotterson

@pombredanne is this still under review?

0t1s1 avatar Feb 04 '22 18:02 0t1s1

I sure would like to merge this. What's it going to take? @pombredanne ???

jotterson avatar Jun 28 '22 12:06 jotterson

Hi @jotterson . Sorry this got sidelined for so long! We are trying to renew activity in this repo and this is certainly an important fix (or rather addition). There is a backlog of open PRs though, and https://github.com/spdx/tools-python/pull/197 seems to have overlap. We'll try to get that one merged first, then figure out which changes in this PR should still go in.

nicoweidner avatar Oct 25 '22 09:10 nicoweidner

I'm working on this. There are many requested changes, plus a potential rebase. It's gonna take a few days.

If you're in a hurry, I'm about half done, all the tests are currently passing, but the example data is still old. Let me know, and I will make a new commit with the fixes that I have.

jotterson avatar Nov 05 '22 01:11 jotterson

I'm working on this. There are many requested changes, plus a potential rebase. It's gonna take a few days.

If you're in a hurry, I'm about half done, all the tests are currently passing, but the example data is still old. Let me know, and I will make a new commit with the fixes that I have.

I'd like to get this merged fairly soon, but there is no immediate hurry. I'd say a timeline of about a week would be better than a month. I feel bad creating pressure here, since this PR was ignored for so long, sorry about that! Our plan is to get all outstanding fixes (and maybe a few additional ones) in and do an as-is release, since the last release was like 3 years ago. Then we plan to tackle some larger structural changes to the tools. In case you feel you'll be short on time, I can also try to help out with some of the fixes - in particular the rebase may take some work. I believe @meretp already had a look at that and noticed quite a few conflicts.

Also, sorry for the pile of comments, I tend to be very verbose in reviews :sweat_smile:. I do this with the mindset that it's ok to disagree with comments, I am very rarely dogmatic (some kind of reason should be given though).

nicoweidner avatar Nov 08 '22 11:11 nicoweidner

The tag/value parser is going to need a real enema. There are some horrible assumptions in there related to sequencing, package before files, etc. The latest tag/value files from the tools-java project really demonstrate that problem.

IMHO the current yacc solution is excessively complicated.

jotterson avatar Nov 13 '22 15:11 jotterson

The tag/value parser is going to need a real enema. There are some horrible assumptions in there related to sequencing, package before files, etc. The latest tag/value files from the tools-java project really demonstrate that problem.

IMHO the current yacc solution is excessively complicated.

The tag/value parser is a catastrophe. Configuring the parser via docstrings is honestly the most dubious thing I have seen in my coding life so far. I am not sure how a reasonably simple solution could look like, though. I think tag/value is inherently a bad format for SPDX, because SPDX documents are deeply nested and tag/value has no native support for nesting. The tools-java library uses a custom-built solution that uses less black magic, but it's not really shorter either (see parser and builder)

That said, @meretp, @armintaenzertng and myself dealt with it (read: had to deal with it...) recently while adding some new properties. If it's causing you headaches, feel free to skip tag/value for now and we'll try to finish it afterwards.

nicoweidner avatar Nov 14 '22 14:11 nicoweidner

Here is an update to my commit of 51 weeks ago (!).

The CircleCI tests are failing because the tag/value and RDF formats are not working. JSON, YAML, XML all work, spdx data can be converted between those three formats with no apparent loss of fidelity.

jotterson avatar Nov 20 '22 20:11 jotterson

Hi @jotterson. I am afraid I have to step in here, because with your recent massive changes, I see no way to get this PR rebased and merged. It is way too big for a single PR and addresses a variety of issues instead of focusing on a single one (several of those issues are already fixed on main). I do realize that the missing review for this PR (for about a year) played a large part in the current situation. I'm really sorry about that, but it's neither something I can change nor something that changes the situation.

In order to get this merged, the PR needs to be broken up into smaller chunks, such as:

  • support for multiple checksums and file types, as the name suggests (could even be broken up into two)
  • adding example files from tools-java
  • annotation parsing changes (didn't look at them in detail, but I don't think they belong in this PR)
  • changes to external references
  • changes to the xml parser (including a description of how this is preferrable to the previous simple solution that converted the xml to a plain dict)

With the changes concentrated in very few, massive commits, I don't see a way to easily cherry-pick specific aspects. The latest commit is also missing a signoff, though that is probably the smallest problem.

Issues that are already fixed on main:

  • renaming licenseInfoFromSnippet
  • files on document-level instead of nested inside packages
  • adding SPDX v2.3 properties such as packagePrimaryPurpose (2.3 support issues are tracked in https://github.com/spdx/tools-python/issues/227)
  • snippet byte/line range

Things that shouldn't be changed:

  • document_describes and has_files are Json-specific fields (+yaml/xml), they should not be added to the data model. They have to be parsed from Json and written to Json when de-/serializing

I am happy to help get smaller PRs reviewed quickly so the changes can be merged, but rebasing this PR in the current state will simply not be possible, I'm afraid.

nicoweidner avatar Nov 22 '22 17:11 nicoweidner