xmlschema XML Schema Download incorrectly modifies the schema

I am trying to download XML schema from a remote URL and it seems to be modifying one of the schema document incorrectly.

Here's a snippet of the code to reproduce the issue:

import xmlschema
import os.path
import urllib

def main():
    # Schema Base URL resource
    xsd_base_uri = "http://www.accellera.org/XMLSchema/IPXACT/1685-2022/index.xsd"

    # Extract Path from URI
    path = urllib.parse.urlparse(xsd_base_uri).path
    print(path)

    # Split path into path + resource name
    schema_path = os.path.split(path)
    print(schema_path)

    target_path = f"schemas/{schema_path[0]}"

    # Create Directory if it doesn't exists:
    os.makedirs(target_path, exist_ok=True)

    local_target_path = f"{target_path}/{schema_path[1]}"

    if not os.path.isfile(f"{local_target_path}"):
        schema = xmlschema.XMLSchema(xsd_base_uri)
        schema.export(target=target_path, save_remote=True)
    schema  = xmlschema.XMLSchema(local_target_path)

if __name__ == '__main__':
    main()

The library seems to be modifying the xs:import line in the autoConfigure.xsd document:

The Left side is the original file downloaded from the url: http://www.accellera.org/XMLSchema/IPXACT/1685-2022/autoConfigure.xsd

Because of the edit the XML Validation would fail due to incorrect XML Schema specification with the following error:

xmlschema.validators.exceptions.XMLSchemaParseError: the QName 'xml:id' is mapped to the namespace 'http://www.w3.org/XML/1998/namespace', but this namespace has not an xs:import statement in the schema:

Schema component:

  <xs:attribute xmlns:xs="http://www.w3.org/2001/XMLSchema" ref="xml:id" />

Path: /xs:schema/xs:attributeGroup[2]/xs:attribute

For now there are other changes also but are of no significant impact. Is there an option to download the XML Schemas without editing?

Feb 22 '24 04:02 AmeyaVS

Hi, thank you for the detailed explanation.

An alternative is the uri_mapper option, available since release v3.0.0 (download the schemas manually and then provide a map for remote URLs to local paths).

Feb 22 '24 09:02 brunato

For now, I have manually reverted the changes in the XML Schema affecting me to move ahead. Would it be possible to call out which files (logs, etc.) are being modified from their original sources? I spent nearly half day before realizing the underlying issue.

Feb 22 '24 09:02 AmeyaVS

I could add a logger for export method (export_schema function in fact), providing loglevel optional argument like it's now for schema initialization/building.

Feb 22 '24 11:02 brunato

Maybe for solving this a fix in this helper can be sufficient:

def replace_location(text: str, location: str, repl_location: str) -> str:
    repl = 'schemaLocation="{}"'.format(repl_location)
    pattern = r'\bschemaLocation\s*=\s*[\'\"].*%s.*[\'"]' % re.escape(location)
    return re.sub(pattern, repl, text)

The replacement pattern matches also the namespace part so the XML namespace has no xs:import element in your case.

Also another improvement (reducing useless changes) could be to skip the erasing of residual non-remote locations.

Mar 11 '24 17:03 brunato

The new release v3.1.0 has a fix for schema exports . The replacement pattern has been changed with a safer one (considering that the source is a valid XML document ...) and the residual imports are cleared only if schemaLocation contains a remote URL.

Also a logging facility has been added to export_schema() function (activable providing logging='DEBUG' to XMLSchema.export()).

Mar 13 '24 07:03 brunato

I tried out the latest release, it seems to not modify the xml schema. But it seems to not download the dependency on the xml.xsd in the import statement. Should I create another bug report for that?

Mar 13 '24 09:03 AmeyaVS

The XML namespace is already loaded within the meta-schema, so an xs:import element has to be present in the schema if the namespace is used (e.g. xml:base) but the import has no effect (and in many cases i found that the location points to an HTML page instead of a regular XSD file).

So the download of remote xml.xsd (if any) is not necessary for xmlschema, the problem is only the removal of the namespace attribute from the xs:import statement.

Mar 13 '24 09:03 brunato

To clarify: the schema export doesn't download nothing, it only uses the already downloaded XSD sources contained in the schema instance and save them locally.

Mar 13 '24 10:03 brunato

(and in many cases i found that the location points to an HTML page instead of a regular XSD file).

I'm sorry, I didn't remember well, the referred xml.xsd (e.g. "http://www.w3.org/2001/xml.xsd" is an XSD file with a stylesheet).

Schema classes use a meta-schema that already has loaded a minimal set of base namespaces:

XML namespace: "http://www.w3.org/2001/xml.xsd"
XSI namespace: 'http://www.w3.org/2001/XMLSchema-instance'
XSD namespace: 'http://www.w3.org/2001/XMLSchema'
VC namespace: 'http://www.w3.org/2007/XMLSchema-versioning' (XMSchema11 only)

I cannot remove XML from base namespaces because xml:lang is used in XSD namespace meta-schema (with a regular import). The meta-schema does a fundamental part in validation and decoding in an efficient mode, despite it can be rebuild if it's needed.

Anyway I think the export procedure can be extended with another option for doing a tentative of loading and saving the residual locations referred by skipped xs:import elements. I will try this way for a next release.

Mar 14 '24 17:03 brunato

FYI about the special status of the above four base namespaces: https://www.w3.org/TR/xmlschema11-1/#sec-nss-special

Mar 16 '24 10:03 brunato

@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas() for download a set of schemas giving an URL as the starting point.

Mar 18 '24 10:03 brunato

@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas() for download a set of schemas giving an URL as the starting point.

Makes sense. Thank you for getting a fix in quickly.

Mar 18 '24 12:03 AmeyaVS

Should I close this issue in the meantime, or should I keep it open once the download_schemas() API is ready. I could update here with details or any other observations.

Mar 22 '24 07:03 AmeyaVS

Keep it open, the next minor release should be ready soon.

Mar 22 '24 08:03 brunato

The download_schemas() API is available with release v3.2.0.

Apr 02 '24 12:04 brunato

Hello @brunato ,

I tried the following code to try and observe the download_schemas API.

import xmlschema
import os.path
import urllib

from xmlschema import download_schemas

def main():
    # Schema Base URL resource
    xsd_base_uri = "http://www.accellera.org/XMLSchema/IPXACT/1685-2022/index.xsd"

    # Extract Path from URI
    path = urllib.parse.urlparse(xsd_base_uri).path
    print(path)

    # Split path into path + resource name
    schema_path = os.path.split(path)
    print(schema_path)

    target_path = f"schemas/{schema_path[0]}"

    # Create Directory if it doesn't exists:
    os.makedirs(target_path, exist_ok=True)

    local_target_path = f"{target_path}/{schema_path[1]}"

    # Download schemas
    download_schemas(xsd_base_uri, target="schemas2")

    if not os.path.isfile(f"{local_target_path}"):
        schema = xmlschema.XMLSchema(xsd_base_uri)
        schema.export(target=target_path, save_remote=True)
    schema  = xmlschema.XMLSchema(local_target_path)

if __name__ == '__main__':
    main()

And observing following error message on the console with respect to the xsd URL:

Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/addressBlockDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/registerFileDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/registerDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/memoryMapDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/enumerationDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/fieldDefinition.xsd: not well-formed (invalid token): line 15, column 46

While looking at the index.xsd source file:

<xs:schema xmlns:ipxact="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" xmlns:xs="http://www.w3.org/2001/<xs:schema xmlns:ipxact="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" elementFormDefault="qualified">
	<xs:include schemaLocation="busDefinition.xsd"/>
	<xs:include schemaLocation="component.xsd"/>
	<xs:include schemaLocation="design.xsd"/>
	<xs:include schemaLocation="designConfig.xsd"/>
	<xs:include schemaLocation="abstractionDefinition.xsd"/>
	<xs:include schemaLocation="catalog.xsd"/>
	<xs:include schemaLocation="abstractor.xsd"/>
	<xs:include schemaLocation="typeDefinitions.xsd"/>
	<!-- <xs:include schemaLocation="memoryMapDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="addressBlockDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="registerFileDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="registerDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="fieldDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="enumerationDefinition.xsd"/> -->
	<xs:group name="IPXACTDocumentTypes">

It seems xmlschema is also parsing the commented section which anyway are invalid schema definitions.

Let me know if additional context is needed.

Regarding the 2 different ways to get the schemas results in identical schemas being downloaded for my use case.

Apr 02 '24 13:04 AmeyaVS

Ok, maybe better to abandon regex for extracting schemaLocation list from text and use an iteration on ElementTree structure instead. For the next bugfix release.

thank you

Apr 02 '24 14:04 brunato

Another observation which I missed out yesterday, between both the approaches is the missing xml.xsd when using the save_remote parameter on the export API:

Apr 03 '24 11:04 AmeyaVS

@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas() for download a set of schemas giving an URL as the starting point.

Changing that for export is not recommendable because xml.xsd is already included in base schema set, so the xmlschema library doesn't need to save another copy of xml.xsd. If you want you can try an export after creating the schema providing use_meta=False.

Anyway the download_schema() API will download all the referred XSD resources.

Apr 04 '24 07:04 brunato

Sounds good, let me know if you want me to close this issue.

Apr 05 '24 15:04 AmeyaVS

Now the changes are published. Try the updated code and report other problems eventually, or close the issue. Thank you

Apr 17 '24 20:04 brunato

Sorry, for the delay. Closing the issue.

Apr 30 '24 12:04 AmeyaVS

xmlschema xmlschema copied to clipboard

XML Schema Download incorrectly modifies the schema

xmlschema
xmlschema copied to clipboard