cyclonedx-python-lib icon indicating copy to clipboard operation
cyclonedx-python-lib copied to clipboard

Validation errors are hard to present safely to the user (missing abstraction)

Open e3krisztian opened this issue 6 months ago • 6 comments

https://cyclonedx-python-library.readthedocs.io/en/v10.2.0/autoapi/cyclonedx/validation/

We are using both JSON and XML inputs, and when something is wrong with the input, it is not easy to get the location of the problem or even what is wrong can be hidden in a multi-MB message.

One of the problem is, that the underlying libraries make it hard:

  • jsonschema includes all the input (instance) in the error message, which in the SBOM case can be quite big, producing the above mentioned multi-MB message (this uniqueItems check can fail on e.g. the dependencies):
            yield ValidationError(f"{instance!r} has non-unique elements")
    
  • in the xml case, somehow the easiest solution was to get the error from the logs: https://github.com/CycloneDX/cyclonedx-python-lib/blob/1a932a2ab00efb029c7b685cba7d9e5af3b7ea19/cyclonedx/validation/xml.py#L71

The other problem is, that CycloneDX makes no attempt at transforming these different object types into something sensible and type-safe for users, the raw objects are simply leaked through the interface as is in https://github.com/CycloneDX/cyclonedx-python-lib/blob/1a932a2ab00efb029c7b685cba7d9e5af3b7ea19/cyclonedx/validation/init.py#L36

Code samples triggering long messages:

from cyclonedx.validation.json import JsonStrictValidator
from cyclonedx.schema import SchemaVersion

test_data_file = "tests/_data/schemaTestData/1.2/invalid-license-id-1.2.json"
schema_version = SchemaVersion.V1_2
validator = JsonStrictValidator(schema_version)
with open(test_data_file) as tdfh:
    test_data = tdfh.read()
validation_error = validator.validate_str(test_data)
print(str(validation_error))

This message is 35508 characters long - 767 lines!

from cyclonedx.validation.xml import XmlValidator
from cyclonedx.schema import SchemaVersion

test_data_file = "tests/_data/schemaTestData/1.1/invalid-license-id-1.1.xml"
schema_version = SchemaVersion.V1_1

validator = XmlValidator(schema_version)
with open(test_data_file) as tdfh:
    test_data = tdfh.read()
validation_error = validator.validate_str(test_data)
print(str(validation_error))

This message is 12423 characters long - 1 line.


I would expect the errors returned/raised by CycloneDX something like below:

class ValidationError:
    # abstract class
    data: Any
    "raw problem, for debugging"

    path: str
    message: str

class XmlValidationError(ValidationError):
     # this subclass knows what data is
    @property
    def path(self):
        return self.data.path

    @property
    def message(self):
        return self.data.message

class JsonValidationError(ValidationError):
     # this subclass knows what data is
    @property
    def path(self):
        return self.data.json_path

    @property
    def message(self):
        # ensures the error is transformed to something sensible
        # resolving a problem caused by using jsonscheme for CycloneDX users
        instance = repr(self.data.instance)
        return self.data.message.replace(instance, shortened(instance))
        # where shortened(long_text) ~ 'first n ... last n', that is the middle of the string replaced
        # this would still add some context, but it will be safe to display

These would provide a stable abstraction over generally useful validation error properties, and also hide implementation details from users, like third party objects lxml.etree._LogEntry and jsonschema.exceptions.ValidationError. The above proposal is also backward compatible, keeping data intact, if someone depends on it.

e3krisztian avatar Jun 17 '25 15:06 e3krisztian

This python library is a community effort. Feel free to donate improvements via pullrequeasts. Here are our contribution guidelines: https://github.com/CycloneDX/cyclonedx-python-lib/blob/main/CONTRIBUTING.md

PS: please be aware that the underlying validation libraries are intended to be exchangeable - don't depend too much on their infra.

jkowalleck avatar Jun 18 '25 08:06 jkowalleck

let's discuss how a useful validation.ValidationError could look like. (dont care for feasibility right now; let's share our dreams/wishes and tell user stories)

I'll go first - just a very minimalistic outline of something that i would love to have

class ValidationError:
    """Validation failed with this specific error."""

    def __init__(self, message: str) -> None:
      self._message = message


    @property
    def message () -> str:
      """Humanreadable, brief error message"""
      # exameples for XML error: `invalid value "foo" for enum field /bom/components[13]/@type`
      return self._message

    def __repr__(self) -> str:
        ...  # the usual stuff

    def __str__(self) -> str:
        ...  # the usual stuff

class XmlValidationError(ValidationError):
  pass


class JsonValidationError(ValidationError):
  pass

this is just my kick-off - not a final proposal :-) Please post your ideas and examples in the comments below.

PS: I would really be interested what the user story for path. what are you going to do with it, if you had it?

jkowalleck avatar Jun 27 '25 16:06 jkowalleck

This is very similar to what was proposed above, except for the path attribute, so I have nothing else to add to this interface. The message implementation is however non-trivial, as it is not clear how to reduce the size for human consumption (I have edited the description to include code samples using test data triggering some longish messages).

path is not special - it is just information, which might or might not be present in the string representation message, but is very useful info for developers: some tool must have created the bad sbom, and path shows where the problem is.

In an ideal world we could happily use just message, but we might have more type specific useful info to expose (e.g. in the XML case there seems to be a line property as well).

e3krisztian avatar Jun 30 '25 09:06 e3krisztian

Is there any user story for the path and line except being informational for debugging? I mean, is there some automatism that would need the path/line in an atomic field, and if so, what's the user story behind it?

jkowalleck avatar Jun 30 '25 10:06 jkowalleck

For me, the user story is this: this is an error object, that is created, when the input document has a problem and is rejected. This object may hold information that can be used to help in fixing either the document or the process leading to the document. Path and line are information that can be presented as part of the message to the user who will do this fixing. They may use it for debugging either the document or some tool that produced the invalid SBOM document.

The library provided message may be good for 80% of cases (or more), but maybe not all, and if it is in the not so helpful case, it may be better to build a new end-user message from the error properties, than use the library provided one. I can not give you examples when the library provided shortened message will not be sufficient, as I have no data yet, however the code I am working on already extends the messages with the path info.

Please also note, that it is not necessarily the same group of people who are creating the sbom document (and performing the debugging on rejected ones) and who are using the cyclonedx library.

e3krisztian avatar Jun 30 '25 12:06 e3krisztian

I see. I will dig into the validator libraries in use, and see if i can collaborate with the respective maintainers in case the needed information is not publicly accessible.

jkowalleck avatar Jun 30 '25 15:06 jkowalleck