tools-python
tools-python copied to clipboard
Inconsistent validation in writing and reading SPDX files; inconsistent examples
In the examples directory, there are the following two scripts:
- write_tv.py
- pp_tv.py
Running
python3 write_tv.py output
generates a spdx file without problem. Reading this file again:
python3 pp_tv.py output output2
Invalid DocumentNamespace value spdx, must contain a scheme (e.g. "https:") and should not contain the "#" delimiter, line:5
Errors encountered while parsing
The given line is:
DocumentNamespace: spdx
It seems that the writing of this value passed parsing in the writer, but reading the same value gives a mismatch.
I have spent some time in the code to see what is going on, and here is my analysis.
Writing
In write_tv.py, a Document object is created and subsequently written to a file using
write_document from from spdx.writers.tagvalue. This function has validate=True as default.
This implies that validate_namespace in document.py is called, but this only checks if there is a value, but no validation on the value.
The document is then written to disk without errors. If it would have given errors, an exception handler in write_tv would have triggered.
Reading
When reading the file, it is being validated against validators from validations.py, in particular function validate_doc_namespace.
This function does not only check if the value is present, but also checks if the value starts with a certain value.
Reading the outputfile of write_tv fails, because write_tv allows the invalid value 'spdx' being written in the file, which is not accepted by the stringent check in validations.py.
Suggestion
With my limited view on the project, I would say that it should be impossible to write a SPDX file with write_document that is not parsable by the projects own parser. Perhaps it would be a good idea to have the validation in document.py reuse the more stringent validators in validations.py so that there is only one set of validation function to maintain.
Thanks for your issue! I had a look at it and came to the same conclusion. In my opinion there are two ways to fix this issue:
-
Easy fix: Adapt the
write_tv.pyso that a valid document with a valid namespace is generated. -
A somewhat more 'consistent fix': As you already proposed-use the same validation for parsing and writing files. I am not sure why the validation was implemented in these two different ways in the first step. Maybe @pombredanne or @licquia can help here. What do you think about changing the code so that the validation of documents is also done with the methods from
validations.py?
The new refactored version now found on main features a completely separate validation layer that can be used for consistent validation. Additionally, all invalid examples have been removed.