space_packet_parser
space_packet_parser copied to clipboard
Add a schema validator / checker
It would be helpful to add a schema validation step or point to advice for how to do this for new users just setting up their initial XTCE documents. There is an official xsd document: https://www.omg.org/spec/XTCE/20180204/SpaceSystem.xsd This would also catch incorrect implementations on our end, like with the float naming identified in #128
A quick search shows there are some packages that do this, although I have no idea how robust/useful they would be: https://xmlschema.readthedocs.io/en/latest/usage.html
Some questions to think about:
- Do we want to add a dependency for schema validation, or just point to an external tool to do this. Could be a package extra too and not on by default.
- Do we want to commit the official xsd to the repository for checking? I have had that site go down on me several times now...
I agree that this would be a useful addition. I know that lxml has some XSD-based validation capabilities already. Since lxml is already a dependency, maybe we can look at whether it would suffice?
That looks great if we can leverage what we already have as a dependency!
Does it make the most sense to automatically try to validate against the schema upon instantiation of the XtcePacketDefinition class?
Does it make the most sense to automatically try to validate against the schema upon instantiation of the XtcePacketDefinition class?
Probably if it is a quick check, we already have the built-in dependency, and it doesn't raise on valid but complex schemas. It would be good to make sure wherever we add it that we add useful information saying why the schema match failed.
Here is a little example of doing schema validation with lxml. TIL that author is not a valid field in the XTCE Header and that validationStatus is a required field...
from lxml import etree
import requests
defdoc = "tests/test_data/jpss/jpss1_geolocation_xtce_v1.xml"
xsi_uri = "http://www.w3.org/2001/XMLSchema-instance"
tree = etree.parse(defdoc)
root = tree.getroot()
print(root.attrib)
# This attribute must always be a pair of strings separated by a space: e.g.
# xsi:schemaLocation="http://www.omg.org/spec/XTCE/20180204 https://www.omg.org/spec/XTCE/20180204/SpaceSystem.xsd"
xsd, xsd_url = root.attrib[f"{{{xsi_uri}}}schemaLocation"].split()
xsd_doc = requests.get(xsd_url)
#print(xsd_doc.content)
xsd_tree = etree.XML(xsd_doc.content)
xmlschema = etree.XMLSchema(xsd_tree)
result = xmlschema.validate(tree)
print(result)
The results of this validation are absolutely useless for debugging (True/False) so I used this tool to fix things: https://www.freeformatter.com/xml-validator-xsd.html
Indeed.
BTW, I started exploring this—I have a branch called feature/xml-schema-validation if you want to take a look.
I've found that using the assertion-based validation is a lot better/more useful than just the boolean .validate() method.
@cgobat That's great! Do you want to own this feature or would you rather someone else use that as a starting point? I would like to add automatic retrieval of XTCE XSD via the xsi:schemaLocation attribute (if provided) to this ticket.
I'm almost ready with a PR that handles namespaces much better, which might help with this.
Yeah I'm happy to own the initial implementation at least. That's good to know that you're in favor of automatic retrieval of the specified XSD. I was going to do that but held off since I wasn't sure how we felt about fetching content from online. Are you okay with adding requests as a dependency? There's a good chance it's already a dependency of an existing dependency, but I'm not positive.
Actually, nevermind. I think I can do it with just the standard library urllib if that's preferable.
Yeah, let's avoid another dependency if we don't need it. If we start doing a lot of HTTP requests for some other reason, then let's consider including requests.
And as for fetching XML online, it is technically a security vulnerability since parsing XML with lxml is vulnerable to things like XML bombs if you're using an old version of libxml.
We could limit the XSD retrieval to a validation method that is only called if the user explicitly requests it. This has the dual purpose of not inherently slowing down definition instantiation from XTCE, which has been a problem in the past.
@cgobat We're starting to gear up for the 6.0.0 release. How's it going with this feature? Do you think it might be finished soon so we can include it?