spdx-spec icon indicating copy to clipboard operation
spdx-spec copied to clipboard

Snippet range internal consistency

Open davaya opened this issue 3 years ago • 5 comments

https://spdx.github.io/spdx-spec/5-snippet-information/#53-snippet-byte-range contains a tag example

SnippetByteRange: 310:420

and an RDF example

<Snippet rdf:about="...">
    <range>
        <ptr:StartEndPointer>
            <ptr:startPointer>
                <ptr:ByteOffsetPointer>
                    <ptr:reference rdf:resource="#SPDXRef-fileReference/>
                    <ptr:offset>310</ptr:offset>
                </ptr:ByteOffsetPointer>
            </ptr:startPointer>
            <ptr:endPointer>
                <ptr:ByteOffsetPointer>
                    <ptr:reference  rdf:resource="#SPDXRef-fileReference/>
                    <ptr:offset>420</ptr:offset>
                </ptr:ByteOffsetPointer>
            </ptr:endPointer>
        </ptr: StartEndPointer>
    </range>
</Snippet>

The tag example is self-consistent - the range start and the end are both offsets, and they refer to the same file identified in section 5.2 SnippetFromFile:

    <snippetFromFile rdf:about="#SPDXRef-filecontainingsnippet">
        ...
    </Snippet>

The RDF example adds a huge amount of detail, all of which serves only as an attack surface or opportunity for error.

Consider this example with intentional inconsistencies:

<Snippet rdf:about="#SPDXRef-filecontainingsnippet">
    <range>
        <ptr:StartEndPointer>
            <ptr:startPointer>
                <ptr:ByteOffsetPointer>
                    <ptr:reference rdf:resource="#SPDXRef-file2Reference/>
                    <ptr:offset>310</ptr:offset>
                </ptr:ByteOffsetPointer>
            </ptr:startPointer>
            <ptr:endPointer>
              <ptr:LineCharPointer>
                  <ptr:reference rdf:resource="#SPDXRef-file3Reference"/>
                  <ptr:lineNumber>23</ptr:lineNumber>
              </ptr:LineCharPointer>
            </ptr:endPointer>
        </ptr: StartEndPointer>
    </range>
</Snippet>

The start and end refer to two different resources, each of which is different from the filecontainingsnippet, and the end pointer is a line number instead of a byte offset. Do any tools detect that start and end pointers must be the same type, and that resources must agree with the rdf:about?

The JSON data example follows the RDF structure rather than the tag structure, and the JSON schema gives up and doesn't attempt to validate the data.

  "snippets" : [ {
    "SPDXID" : "SPDXRef-Snippet",
    "comment" : "...",
    "copyrightText" : "...",
    "licenseComments" : "...",
    "licenseConcluded" : "GPL-2.0-only",
    "licenseInfoInSnippets" : [ "GPL-2.0-only" ],
    "name" : "from linux kernel",
    "ranges" : [ {
      "endPointer" : {
        "lineNumber" : 23,
        "reference" : "SPDXRef-DoapSource"
      },
      "startPointer" : {
        "lineNumber" : 5,
        "reference" : "SPDXRef-DoapSource"
      }
    }, {
      "endPointer" : {
        "offset" : 420,
        "reference" : "SPDXRef-DoapSource"
      },
      "startPointer" : {
        "offset" : 310,
        "reference" : "SPDXRef-DoapSource"
      }
    } ],
    "snippetFromFile" : "SPDXRef-DoapSource"
  } ],

Suggest that the JSON "ranges" property be replaced with the far simpler, easier to validate, and less error prone structure consistent with the tag format:

{
    "lineRanges": [[5, 23], [56, 68]],
    "byteRanges": [[310, 420]],
    "snippetFromFile" : "SPDXRef-DoapSource"
}

and that a shacl shape consistent with this structure be used in the v3 ontology.

davaya avatar Jul 06 '21 21:07 davaya

There is a proposal to simplify this in version 3.0 of SPDX. So for, no one has objected to simplifying.

The general syntax is fine with me, but I would prefer to have a separate Snippet object for each range.

A snippet is defined to be a range of code, if there is code from a different origin in between ranges, I would consider that 2 separate snippets.

This would make it easier to correct if a snippet was incorrectly ID'd or if 2 separate consecutive lines were from different origins.

goneall avatar Jul 08 '21 04:07 goneall

BTW - the reason for the rather complex syntax in RDF is we wanted to implement an existing W3C draft standard for pointers.

goneall avatar Jul 08 '21 04:07 goneall

A separate Snippet per code range is good.

I'd think that if W3C were to create a pointer range, they would come up with something that specified each bit of data only once. The "bytes with optional line numbers" for a single code range is semantically different from two separate pointer ranges specifying bytes from code and lines from code.

Ask WWW3CD? 😊

{
    ...
    "name" : "from linux kernel",
    "snippetFromFile" : "SPDXRef-DoapSource",
    "byteRange": [310, 420],
    "lineRange": [5, 23]
}

davaya avatar Jul 08 '21 12:07 davaya

Big +1 to this. It's made worse by the fact that byte range is a required property for snippets according to https://spdx.github.io/spdx-spec/v2.3/snippet-information/#93-snippet-byte-range-field. Any tool supporting snippets has to deal with all of the possible inconsistencies pointed out by @davaya if it offers validation, which is a major and unnecessary pain. I am confused that this issue is not more prominent - are snippets not used much?

In my opinion, the json spec made the situation even worse:

  • It only requires either a byte range or a line range, which is inconsistent with the tag/value spec: https://github.com/spdx/spdx-spec/blob/v2.3/schemas/spdx-schema.json#L651-L700 (maybe an issue of itself)
  • It allows for any number of ranges (instead of exactly one byte range and up to one line range)
  • startPointer and endPointer can have both offset and lineNumber
  • startPointer and endPointer could also have neither offset nor lineNumber

nicoweidner avatar Nov 10 '22 13:11 nicoweidner

I am confused that this issue is not more prominent - are snippets not used much?

Snippets are not very common, but they are used.

The main reason this proposal hasn't been accepted yet is that it would be a breaking change - hence targeted for 3.0. I'm pretty sure it will get into 3.0 since we all agree it needs to be addressed.

goneall avatar Nov 11 '22 15:11 goneall

Resolved with 3.0.

goneall avatar Apr 04 '24 16:04 goneall