spdx-spec
spdx-spec copied to clipboard
Snippet range internal consistency
https://spdx.github.io/spdx-spec/5-snippet-information/#53-snippet-byte-range contains a tag example
SnippetByteRange: 310:420
and an RDF example
<Snippet rdf:about="...">
<range>
<ptr:StartEndPointer>
<ptr:startPointer>
<ptr:ByteOffsetPointer>
<ptr:reference rdf:resource="#SPDXRef-fileReference/>
<ptr:offset>310</ptr:offset>
</ptr:ByteOffsetPointer>
</ptr:startPointer>
<ptr:endPointer>
<ptr:ByteOffsetPointer>
<ptr:reference rdf:resource="#SPDXRef-fileReference/>
<ptr:offset>420</ptr:offset>
</ptr:ByteOffsetPointer>
</ptr:endPointer>
</ptr: StartEndPointer>
</range>
</Snippet>
The tag example is self-consistent - the range start and the end are both offsets, and they refer to the same file identified in section 5.2 SnippetFromFile:
<snippetFromFile rdf:about="#SPDXRef-filecontainingsnippet">
...
</Snippet>
The RDF example adds a huge amount of detail, all of which serves only as an attack surface or opportunity for error.
Consider this example with intentional inconsistencies:
<Snippet rdf:about="#SPDXRef-filecontainingsnippet">
<range>
<ptr:StartEndPointer>
<ptr:startPointer>
<ptr:ByteOffsetPointer>
<ptr:reference rdf:resource="#SPDXRef-file2Reference/>
<ptr:offset>310</ptr:offset>
</ptr:ByteOffsetPointer>
</ptr:startPointer>
<ptr:endPointer>
<ptr:LineCharPointer>
<ptr:reference rdf:resource="#SPDXRef-file3Reference"/>
<ptr:lineNumber>23</ptr:lineNumber>
</ptr:LineCharPointer>
</ptr:endPointer>
</ptr: StartEndPointer>
</range>
</Snippet>
The start and end refer to two different resources, each of which is different from the filecontainingsnippet, and the end pointer is a line number instead of a byte offset. Do any tools detect that start and end pointers must be the same type, and that resources must agree with the rdf:about?
The JSON data example follows the RDF structure rather than the tag structure, and the JSON schema gives up and doesn't attempt to validate the data.
"snippets" : [ {
"SPDXID" : "SPDXRef-Snippet",
"comment" : "...",
"copyrightText" : "...",
"licenseComments" : "...",
"licenseConcluded" : "GPL-2.0-only",
"licenseInfoInSnippets" : [ "GPL-2.0-only" ],
"name" : "from linux kernel",
"ranges" : [ {
"endPointer" : {
"lineNumber" : 23,
"reference" : "SPDXRef-DoapSource"
},
"startPointer" : {
"lineNumber" : 5,
"reference" : "SPDXRef-DoapSource"
}
}, {
"endPointer" : {
"offset" : 420,
"reference" : "SPDXRef-DoapSource"
},
"startPointer" : {
"offset" : 310,
"reference" : "SPDXRef-DoapSource"
}
} ],
"snippetFromFile" : "SPDXRef-DoapSource"
} ],
Suggest that the JSON "ranges" property be replaced with the far simpler, easier to validate, and less error prone structure consistent with the tag format:
{
"lineRanges": [[5, 23], [56, 68]],
"byteRanges": [[310, 420]],
"snippetFromFile" : "SPDXRef-DoapSource"
}
and that a shacl shape consistent with this structure be used in the v3 ontology.
There is a proposal to simplify this in version 3.0 of SPDX. So for, no one has objected to simplifying.
The general syntax is fine with me, but I would prefer to have a separate Snippet object for each range.
A snippet is defined to be a range of code, if there is code from a different origin in between ranges, I would consider that 2 separate snippets.
This would make it easier to correct if a snippet was incorrectly ID'd or if 2 separate consecutive lines were from different origins.
BTW - the reason for the rather complex syntax in RDF is we wanted to implement an existing W3C draft standard for pointers.
A separate Snippet per code range is good.
I'd think that if W3C were to create a pointer range, they would come up with something that specified each bit of data only once. The "bytes with optional line numbers" for a single code range is semantically different from two separate pointer ranges specifying bytes from code and lines from code.
Ask WWW3CD? 😊
{
...
"name" : "from linux kernel",
"snippetFromFile" : "SPDXRef-DoapSource",
"byteRange": [310, 420],
"lineRange": [5, 23]
}
Big +1 to this. It's made worse by the fact that byte range is a required property for snippets according to https://spdx.github.io/spdx-spec/v2.3/snippet-information/#93-snippet-byte-range-field. Any tool supporting snippets has to deal with all of the possible inconsistencies pointed out by @davaya if it offers validation, which is a major and unnecessary pain. I am confused that this issue is not more prominent - are snippets not used much?
In my opinion, the json spec made the situation even worse:
- It only requires either a byte range or a line range, which is inconsistent with the tag/value spec: https://github.com/spdx/spdx-spec/blob/v2.3/schemas/spdx-schema.json#L651-L700 (maybe an issue of itself)
- It allows for any number of ranges (instead of exactly one byte range and up to one line range)
-
startPointer
andendPointer
can have bothoffset
andlineNumber
-
startPointer
andendPointer
could also have neitheroffset
norlineNumber
I am confused that this issue is not more prominent - are snippets not used much?
Snippets are not very common, but they are used.
The main reason this proposal hasn't been accepted yet is that it would be a breaking change - hence targeted for 3.0. I'm pretty sure it will get into 3.0 since we all agree it needs to be addressed.
Resolved with 3.0.