tools-python icon indicating copy to clipboard operation
tools-python copied to clipboard

mandatory DocumentNamespace is missing in RDF writer

Open alpianon opened this issue 5 years ago • 7 comments

While tagvalue writer adds DocumentNamespace tag to tagvalue files (see this line), rdf writer does not consider it at all, thus producing non-compliant rdf documents, which do not pass validation by java spdx tools and cannot be imported in other applications

alpianon avatar Dec 08 '20 13:12 alpianon

I realized that there are other problems with RDF, I guess this is related to #147

Does this project need more resources?

alpianon avatar Dec 08 '20 13:12 alpianon

For RDF, the DocumentNamespace is the namespace for the SpdxDocument. For example, if there is a DocumentNamespace http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301 there should be a property triple in the RDF like <spdx:SpdxDocument rdf:about="http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301#SPDXRef-DOCUMENT"> ... </spdx:SpdxDocument>

To obtain the namespace, the Java code looks for this reference.

In a quick look at the code, it doesn't look like the spdx:SpdxDocument is added to the graph.

@alpianon Can you add an RDF file generated by the Python library? I'll take a look and confirm if this is an issue.

goneall avatar Dec 09 '20 19:12 goneall

OK I prepared an example.

In zlib-rdf-test.zip you may find:

  1. zlib-1.2.11.spdx : a tagvalue spdx file, generated by Scancode, which uses spdx-tools-python to generate it; I patched Scancode in order to have mandatory DocumentNamespace tag included
  2. zlib-1.2.11.spdx-python.rdf.xml : a rdf/xml spdx file, generated by Scancode (spdx-tools-python) from the same software package
  3. zlib-1.2.11.spdx-python.rdf.xml : a rdf/xml spdx file, converted from zlib-1.2.11.spdx above through spdx-tools(-java)

The two rdf/xml files look very different from each other, and not only because document namespace is missing if the one generated by spdx-tools-python. See for instance the use of File class tag and of filename tags: the file generated by spdx-tools-python has multiple filename tags inside one single File class tag (which "covers" all files in the package), while the file generated by spdx-tools(-java) has only one filename tag inside each File tag, and there is a File tag for each file in the package. I do not know rdf/xml very well, so I'm not sure which is the right syntax or if both are correct.

BTW, if I try to validate the rdf/xml file generated via spdx-tools-python with spdx-tools(-java) I get the following errors:

Unable to parse the file: File zlib-1.2.11.spdx-python.rdf.xml is not a recognized RDF/XML or tag/value format.
While verifying for Tag/Value format: The SPDX Document Namespace must be set before other SPDX document properties are set..
While verifying for RDF/XML format: No external document reference was found for URI http://www.spdx.org/files#None

Summing up, spdx files 1) and 3) above appear to be valid spdx documents, while file 2) (rdf/xml generated by spdx-tools-python) does not

alpianon avatar Dec 09 '20 21:12 alpianon

From a quick look, it does have the SpdxDocument element, so my hypothesis is not correct.

I'll take a deeper look at the example file and see if I can find out what the issue is.

goneall avatar Dec 09 '20 23:12 goneall

The primary issue is that all of the files are using the wrong namespace URI.

The following line should be changed to use the document namespace followed by # followed by the ID:

https://github.com/spdx/tools-python/blob/d197a3adf95e2f4fc78c6983f5477f9b962bdaab/spdx/writers/rdf.py#L245

I also noticed another issue - the extracted license infos are using anonomous nodes. It must use the following URI:

documentNamespace#LicenseRef-XXX

where XXX is a unique String.

The code that creates the license nodes is here:

https://github.com/spdx/tools-python/blob/d197a3adf95e2f4fc78c6983f5477f9b962bdaab/spdx/writers/rdf.py#L153

It looks like it should be creating a URI node rather than an anon. node.

goneall avatar Dec 10 '20 01:12 goneall

From looking at the output RDF file, the FileID is always "None". This looks like it is an issue with the tag/value file. The tag/value file does not include required SPDX ID's for the licenses.

goneall avatar Dec 10 '20 01:12 goneall

From looking at the output RDF file, the FileID is always "None". This looks like it is an issue with the tag/value file. The tag/value file does not include required SPDX ID's for the licenses.

You're right. And if you look at the zlib-1.2.11.spdx-java.rdf.xml file, you see that spdx-tools(-java) automatically add it when converting from tagvalue to rdf, just by using a counter, like this:

<spdx:File rdf:about="http://spdx.org/spdxdocs/zlib-1.2.11-d5bfea82-7047-4cfa-b447-9012d0cf621d#SPDXRef-132">

alpianon avatar Dec 10 '20 07:12 alpianon

This should be fixed in the refactored version now found on main. Please speak up if the issue should be reopened.

armintaenzertng avatar Mar 30 '23 08:03 armintaenzertng