tools-python
tools-python copied to clipboard
mandatory DocumentNamespace is missing in RDF writer
While tagvalue writer adds DocumentNamespace tag to tagvalue files (see this line), rdf writer does not consider it at all, thus producing non-compliant rdf documents, which do not pass validation by java spdx tools and cannot be imported in other applications
I realized that there are other problems with RDF, I guess this is related to #147
Does this project need more resources?
For RDF, the DocumentNamespace is the namespace for the SpdxDocument. For example, if there is a DocumentNamespace http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301 there should be a property triple in the RDF like <spdx:SpdxDocument rdf:about="http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301#SPDXRef-DOCUMENT"> ... </spdx:SpdxDocument>
To obtain the namespace, the Java code looks for this reference.
In a quick look at the code, it doesn't look like the spdx:SpdxDocument is added to the graph.
@alpianon Can you add an RDF file generated by the Python library? I'll take a look and confirm if this is an issue.
OK I prepared an example.
In zlib-rdf-test.zip you may find:
zlib-1.2.11.spdx: a tagvalue spdx file, generated by Scancode, which uses spdx-tools-python to generate it; I patched Scancode in order to have mandatory DocumentNamespace tag includedzlib-1.2.11.spdx-python.rdf.xml: a rdf/xml spdx file, generated by Scancode (spdx-tools-python) from the same software packagezlib-1.2.11.spdx-python.rdf.xml: a rdf/xml spdx file, converted fromzlib-1.2.11.spdxabove through spdx-tools(-java)
The two rdf/xml files look very different from each other, and not only because document namespace is missing if the one generated by spdx-tools-python. See for instance the use of File class tag and of filename tags: the file generated by spdx-tools-python has multiple filename tags inside one single File class tag (which "covers" all files in the package), while the file generated by spdx-tools(-java) has only one filename tag inside each File tag, and there is a File tag for each file in the package. I do not know rdf/xml very well, so I'm not sure which is the right syntax or if both are correct.
BTW, if I try to validate the rdf/xml file generated via spdx-tools-python with spdx-tools(-java) I get the following errors:
Unable to parse the file: File zlib-1.2.11.spdx-python.rdf.xml is not a recognized RDF/XML or tag/value format.
While verifying for Tag/Value format: The SPDX Document Namespace must be set before other SPDX document properties are set..
While verifying for RDF/XML format: No external document reference was found for URI http://www.spdx.org/files#None
Summing up, spdx files 1) and 3) above appear to be valid spdx documents, while file 2) (rdf/xml generated by spdx-tools-python) does not
From a quick look, it does have the SpdxDocument element, so my hypothesis is not correct.
I'll take a deeper look at the example file and see if I can find out what the issue is.
The primary issue is that all of the files are using the wrong namespace URI.
The following line should be changed to use the document namespace followed by # followed by the ID:
https://github.com/spdx/tools-python/blob/d197a3adf95e2f4fc78c6983f5477f9b962bdaab/spdx/writers/rdf.py#L245
I also noticed another issue - the extracted license infos are using anonomous nodes. It must use the following URI:
documentNamespace#LicenseRef-XXX
where XXX is a unique String.
The code that creates the license nodes is here:
https://github.com/spdx/tools-python/blob/d197a3adf95e2f4fc78c6983f5477f9b962bdaab/spdx/writers/rdf.py#L153
It looks like it should be creating a URI node rather than an anon. node.
From looking at the output RDF file, the FileID is always "None". This looks like it is an issue with the tag/value file. The tag/value file does not include required SPDX ID's for the licenses.
From looking at the output RDF file, the FileID is always "None". This looks like it is an issue with the tag/value file. The tag/value file does not include required SPDX ID's for the licenses.
You're right. And if you look at the zlib-1.2.11.spdx-java.rdf.xml file, you see that spdx-tools(-java) automatically add it when converting from tagvalue to rdf, just by using a counter, like this:
<spdx:File rdf:about="http://spdx.org/spdxdocs/zlib-1.2.11-d5bfea82-7047-4cfa-b447-9012d0cf621d#SPDXRef-132">
This should be fixed in the refactored version now found on main. Please speak up if the issue should be reopened.