sdmx icon indicating copy to clipboard operation
sdmx copied to clipboard

SDMX 3.0 structures seem to not be parsed correctly

Open aboddie opened this issue 4 months ago • 5 comments

For IMF:

IMF_DATA = sdmx.Client('IMF_DATA3')
data_msg = IMF_DATA.data('CPI',agency_id='IMF.STA', context='dataflow', key='USA.CPI._T.IX.A')

breaks when trying to parse the structure in the header. It seems like only structureusage is properly supported. To check if it was a mistake on our side I tried to ingestion BIS 3.0 SDMX data.

>>> response = client.get(url='https://stats.bis.org/api/v2/data/dataflow/BIS/WS_LONG_CPI/1.0/A.DZ.628')
>>> response.structure
<DataStructureDefinition BIS:WS_LONG_CPI(1.0)> 

This incorrectly identifies the dataflow as a data structure definition The actual dsd has ID BIS_LONG_CPI

aboddie avatar Aug 28 '25 20:08 aboddie

Thanks for this report. Although I can reconstruct them, it would help to have:

  • The exact URLs used in the queries.
  • Specific Python error messages.
  • Snippets or contents of the offending files.

For your first example, I see:

>>> req = IMF_DATA.data('CPI',agency_id='IMF.STA', context='dataflow', key='USA.CPI._T.IX.A', dry_run=True)
>>> req.url
https://api.imf.org/external/sdmx/3.0/data/dataflow/IMF.STA/CPI/+/USA.CPI._T.IX.A

Without dry_run=True, I see this XMLParseError:

KeyError                                  Traceback (most recent call last)
File ~/vc/sdmx/sdmx/reader/xml/common.py:202, in XMLEventReader.convert(self, data, structure, _events, **kwargs)
    200     continue  # Explicitly no parser for this (element, event) → skip
--> 202 result = func(self, element)  # Parse the element
    203 self.push(result)  # Store the result

File ~/vc/sdmx/sdmx/reader/xml/v21.py:268, in _header_structure(reader, elem)
    267 # Store under the structure ID, so it can be looked up by that ID
--> 268 reader.push(elem.attrib["structureID"], structure)
    270 # Store as objects that won't cause a parsing error if it is left over

File src/lxml/etree.pyx:2548, in lxml.etree._Attrib.__getitem__()

KeyError: 'structureID'

The above exception was the direct cause of the following exception:

[…]

XMLParseError: KeyError: 'structureID'

…so I assume this is what you mean by "breaks".

Accessing the URL, the relevant snippet is:

<message:StructureSpecificData xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message ../../schemas/SDMXMessage.xsd">
  <message:Header>
    <message:Structure namespace="urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=IMF.STA:DSD_CPI(5.0.0)" dimensionAtObservation="TIME_PERIOD">
      <common:Structure>
        urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=IMF.STA:DSD_CPI(5.0.0)
      </common:Structure>
    </message:Structure>
  </message:Header>

For your second example:

<message:StructureSpecificData xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message https://registry.sdmx.org/schemas/v2_1/SDMXMessage.xsd urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:WS_LONG_CPI(1.0):ObsLevelDim:TIME_PERIOD https://stats.bis.org/ws/public/sdmxapi/rest/schema/dataflow/BIS/WS_LONG_CPI/1.0?format=sdmx-2.1">
  <message:Header>
    <message:Structure structureID="BIS_WS_LONG_CPI_1_0" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:WS_LONG_CPI(1.0):ObsLevelDim:TIME_PERIOD" dimensionAtObservation="TIME_PERIOD">
      <common:StructureUsage>
        <Ref agencyID="BIS" id="WS_LONG_CPI" version="1.0"/>
      </common:StructureUsage>
    </message:Structure>
  </message:Header>

khaeru avatar Aug 29 '25 10:08 khaeru

So these seem to be two separate issues:

  • In the first example, SDMX-ML 3.0, the reader.push(elem.attrib["structureID"], structure) line should be allowed to fail, or be skipped, if that attribute is not present. I will need to recheck the schemas whether it is optional in SDMX-ML 3.0, or removed entirely.
  • In the second example, SDMX-ML 2.1, this is already a sub-optimal situation. The user is trying to retrieve/parse structure-specific SDMX-ML without giving the actual structure. sdmx.reader.xml is thus forced to infer everything about the structure.¹ Currently it guesses that the given <Ref …> is to a DataStructureDefinition. That guess could be more sophisticated, e.g. also checking the <mes:Structure namespace="..."> attribute to identify that here the <Ref …> is to a DataflowDefinition. But the code can never guess the ID/URN of the (not provided) DSD that the (not provided) DFD points to; it could be literally anything. In other words, when you say "The actual DSD has ID BIS_LONG_CPI", that isn't information available from within the message itself.

¹ As an aside, I have considered a few times the idea to automatically make a sub-query to fetch missing structure(s) when structure-specific data is to be parsed. This would add a lot of complexity, and it's not a top priority from my POV.

To fix the first issue seems straightforward. I'd welcome any thoughts on what to do for the second.

khaeru avatar Aug 29 '25 10:08 khaeru

For the second issue, currently we have:

>>> response.structure
<DataStructureDefinition BIS:WS_LONG_CPI(1.0)>
>>> response.dataflow
<DataflowDefinition (missing id)>

I think giving

>>> response.structure
<DataStructureDefinition (missing id)>
>>> response.dataflow
<Dataflow BIS:WS_LONG_CPI(1.0)>

would be better, but maybe it's hard because 3.0 and 2.1 differ in type for .dataflow. having both return missing would also be ok.

Fetching missing structure would be helpful, RSDMX fetches all the supporting structures off a data query based on an optional parameter, which they use to populate labels instead of codes in their dataframe.

aboddie avatar Aug 29 '25 14:08 aboddie

I agree the second would be more consistent with the contents of the message per se. The example you've given provides a good test specimen to check the behaviour, so I can work from that.

Fetching missing structure would be helpful, RSDMX fetches all the supporting structures off a data query based on an optional parameter

Yeah, the reason I say "I've considered a few times" is it's a feature that would certainly be helpful. It's only because of the complexity (+ lack of bandwidth) that I've shied away from it for now. (For instance: if the user is handling offline SDMX-ML messages from file, when should the code attempt to make an online query for related structures? Should it expect to find the structures in a file? Should it cache to avoid repeated sub-queries, e.g. if the user is too lazy to store the structures locally? And then all of these behaviours would need tests.)

In any case, I can record this as a separate, wishlist item; and then you can +1 it; and we can see if anyone wants to tackle it themselves or support its development.

khaeru avatar Aug 29 '25 14:08 khaeru

Just to update, you are correct structureID is required still in SDMX 3.0 so for the first issue I will aim to fix it on our side.

aboddie avatar Sep 09 '25 14:09 aboddie