xmlschema
xmlschema copied to clipboard
XML Schema Download incorrectly modifies the schema
I am trying to download XML schema from a remote URL and it seems to be modifying one of the schema document incorrectly.
Here's a snippet of the code to reproduce the issue:
import xmlschema
import os.path
import urllib
def main():
# Schema Base URL resource
xsd_base_uri = "http://www.accellera.org/XMLSchema/IPXACT/1685-2022/index.xsd"
# Extract Path from URI
path = urllib.parse.urlparse(xsd_base_uri).path
print(path)
# Split path into path + resource name
schema_path = os.path.split(path)
print(schema_path)
target_path = f"schemas/{schema_path[0]}"
# Create Directory if it doesn't exists:
os.makedirs(target_path, exist_ok=True)
local_target_path = f"{target_path}/{schema_path[1]}"
if not os.path.isfile(f"{local_target_path}"):
schema = xmlschema.XMLSchema(xsd_base_uri)
schema.export(target=target_path, save_remote=True)
schema = xmlschema.XMLSchema(local_target_path)
if __name__ == '__main__':
main()
The library seems to be modifying the xs:import
line in the autoConfigure.xsd
document:
The Left side is the original file downloaded from the url: http://www.accellera.org/XMLSchema/IPXACT/1685-2022/autoConfigure.xsd
Because of the edit the XML Validation would fail due to incorrect XML Schema specification with the following error:
xmlschema.validators.exceptions.XMLSchemaParseError: the QName 'xml:id' is mapped to the namespace 'http://www.w3.org/XML/1998/namespace', but this namespace has not an xs:import statement in the schema:
Schema component:
<xs:attribute xmlns:xs="http://www.w3.org/2001/XMLSchema" ref="xml:id" />
Path: /xs:schema/xs:attributeGroup[2]/xs:attribute
For now there are other changes also but are of no significant impact. Is there an option to download the XML Schemas without editing?
Hi, thank you for the detailed explanation.
An alternative is the uri_mapper option, available since release v3.0.0 (download the schemas manually and then provide a map for remote URLs to local paths).
For now, I have manually reverted the changes in the XML Schema affecting me to move ahead. Would it be possible to call out which files (logs, etc.) are being modified from their original sources? I spent nearly half day before realizing the underlying issue.
I could add a logger for export method (export_schema function in fact), providing loglevel optional argument like it's now for schema initialization/building.
Maybe for solving this a fix in this helper can be sufficient:
def replace_location(text: str, location: str, repl_location: str) -> str:
repl = 'schemaLocation="{}"'.format(repl_location)
pattern = r'\bschemaLocation\s*=\s*[\'\"].*%s.*[\'"]' % re.escape(location)
return re.sub(pattern, repl, text)
The replacement pattern matches also the namespace part so the XML namespace has no xs:import
element in your case.
Also another improvement (reducing useless changes) could be to skip the erasing of residual non-remote locations.
The new release v3.1.0 has a fix for schema exports . The replacement pattern has been changed with a safer one (considering that the source is a valid XML document ...) and the residual imports are cleared only if schemaLocation contains a remote URL.
Also a logging facility has been added to export_schema()
function (activable providing logging='DEBUG'
to XMLSchema.export()
).
I tried out the latest release, it seems to not modify the xml schema.
But it seems to not download the dependency on the xml.xsd
in the import statement.
Should I create another bug report for that?
The XML namespace is already loaded within the meta-schema, so an xs:import
element has to be present in the schema if the namespace is used (e.g. xml:base) but the import has no effect (and in many cases i found that the location points to an HTML page instead of a regular XSD file).
So the download of remote xml.xsd
(if any) is not necessary for xmlschema
, the problem is only the removal of the namespace attribute from the xs:import
statement.
To clarify: the schema export doesn't download nothing, it only uses the already downloaded XSD sources contained in the schema instance and save them locally.
(and in many cases i found that the location points to an HTML page instead of a regular XSD file).
I'm sorry, I didn't remember well, the referred xml.xsd (e.g. "http://www.w3.org/2001/xml.xsd" is an XSD file with a stylesheet).
Schema classes use a meta-schema that already has loaded a minimal set of base namespaces:
- XML namespace: "http://www.w3.org/2001/xml.xsd"
- XSI namespace: 'http://www.w3.org/2001/XMLSchema-instance'
- XSD namespace: 'http://www.w3.org/2001/XMLSchema'
- VC namespace: 'http://www.w3.org/2007/XMLSchema-versioning' (XMSchema11 only)
I cannot remove XML from base namespaces because xml:lang is used in XSD namespace meta-schema (with a regular import). The meta-schema does a fundamental part in validation and decoding in an efficient mode, despite it can be rebuild if it's needed.
Anyway I think the export procedure can be extended with another option for doing a tentative of loading and saving the residual locations referred by skipped xs:import elements. I will try this way for a next release.
FYI about the special status of the above four base namespaces: https://www.w3.org/TR/xmlschema11-1/#sec-nss-special
@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas()
for download a set of schemas giving an URL as the starting point.
@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API
download_schemas()
for download a set of schemas giving an URL as the starting point.
Makes sense. Thank you for getting a fix in quickly.
Should I close this issue in the meantime, or should I keep it open once the download_schemas()
API is ready. I could update here with details or any other observations.
Keep it open, the next minor release should be ready soon.
The download_schemas()
API is available with release v3.2.0.
Hello @brunato ,
I tried the following code to try and observe the download_schemas
API.
import xmlschema
import os.path
import urllib
from xmlschema import download_schemas
def main():
# Schema Base URL resource
xsd_base_uri = "http://www.accellera.org/XMLSchema/IPXACT/1685-2022/index.xsd"
# Extract Path from URI
path = urllib.parse.urlparse(xsd_base_uri).path
print(path)
# Split path into path + resource name
schema_path = os.path.split(path)
print(schema_path)
target_path = f"schemas/{schema_path[0]}"
# Create Directory if it doesn't exists:
os.makedirs(target_path, exist_ok=True)
local_target_path = f"{target_path}/{schema_path[1]}"
# Download schemas
download_schemas(xsd_base_uri, target="schemas2")
if not os.path.isfile(f"{local_target_path}"):
schema = xmlschema.XMLSchema(xsd_base_uri)
schema.export(target=target_path, save_remote=True)
schema = xmlschema.XMLSchema(local_target_path)
if __name__ == '__main__':
main()
And observing following error message on the console with respect to the xsd
URL:
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/addressBlockDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/registerFileDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/registerDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/memoryMapDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/enumerationDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/fieldDefinition.xsd: not well-formed (invalid token): line 15, column 46
While looking at the index.xsd
source file:
<xs:schema xmlns:ipxact="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" xmlns:xs="http://www.w3.org/2001/<xs:schema xmlns:ipxact="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" elementFormDefault="qualified">
<xs:include schemaLocation="busDefinition.xsd"/>
<xs:include schemaLocation="component.xsd"/>
<xs:include schemaLocation="design.xsd"/>
<xs:include schemaLocation="designConfig.xsd"/>
<xs:include schemaLocation="abstractionDefinition.xsd"/>
<xs:include schemaLocation="catalog.xsd"/>
<xs:include schemaLocation="abstractor.xsd"/>
<xs:include schemaLocation="typeDefinitions.xsd"/>
<!-- <xs:include schemaLocation="memoryMapDefinition.xsd"/> -->
<!-- <xs:include schemaLocation="addressBlockDefinition.xsd"/> -->
<!-- <xs:include schemaLocation="registerFileDefinition.xsd"/> -->
<!-- <xs:include schemaLocation="registerDefinition.xsd"/> -->
<!-- <xs:include schemaLocation="fieldDefinition.xsd"/> -->
<!-- <xs:include schemaLocation="enumerationDefinition.xsd"/> -->
<xs:group name="IPXACTDocumentTypes">
It seems xmlschema is also parsing the commented section which anyway are invalid schema definitions.
Let me know if additional context is needed.
Regarding the 2 different ways to get the schemas results in identical schemas being downloaded for my use case.
Ok, maybe better to abandon regex for extracting schemaLocation list from text and use an iteration on ElementTree structure instead. For the next bugfix release.
thank you
Another observation which I missed out yesterday, between both the approaches is the missing xml.xsd
when using the save_remote
parameter on the export
API:
@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API
download_schemas()
for download a set of schemas giving an URL as the starting point.
Changing that for export is not recommendable because xml.xsd is already included in base schema set, so the xmlschema library doesn't need to save another copy of xml.xsd. If you want you can try an export after creating the schema providing use_meta=False
.
Anyway the download_schema()
API will download all the referred XSD resources.
Sounds good, let me know if you want me to close this issue.
Now the changes are published. Try the updated code and report other problems eventually, or close the issue. Thank you
Sorry, for the delay. Closing the issue.