SDMX 3.0 structures seem to not be parsed correctly
For IMF:
IMF_DATA = sdmx.Client('IMF_DATA3')
data_msg = IMF_DATA.data('CPI',agency_id='IMF.STA', context='dataflow', key='USA.CPI._T.IX.A')
breaks when trying to parse the structure in the header. It seems like only structureusage is properly supported. To check if it was a mistake on our side I tried to ingestion BIS 3.0 SDMX data.
>>> response = client.get(url='https://stats.bis.org/api/v2/data/dataflow/BIS/WS_LONG_CPI/1.0/A.DZ.628')
>>> response.structure
<DataStructureDefinition BIS:WS_LONG_CPI(1.0)>
This incorrectly identifies the dataflow as a data structure definition The actual dsd has ID BIS_LONG_CPI
Thanks for this report. Although I can reconstruct them, it would help to have:
- The exact URLs used in the queries.
- Specific Python error messages.
- Snippets or contents of the offending files.
For your first example, I see:
>>> req = IMF_DATA.data('CPI',agency_id='IMF.STA', context='dataflow', key='USA.CPI._T.IX.A', dry_run=True)
>>> req.url
https://api.imf.org/external/sdmx/3.0/data/dataflow/IMF.STA/CPI/+/USA.CPI._T.IX.A
Without dry_run=True, I see this XMLParseError:
KeyError Traceback (most recent call last)
File ~/vc/sdmx/sdmx/reader/xml/common.py:202, in XMLEventReader.convert(self, data, structure, _events, **kwargs)
200 continue # Explicitly no parser for this (element, event) → skip
--> 202 result = func(self, element) # Parse the element
203 self.push(result) # Store the result
File ~/vc/sdmx/sdmx/reader/xml/v21.py:268, in _header_structure(reader, elem)
267 # Store under the structure ID, so it can be looked up by that ID
--> 268 reader.push(elem.attrib["structureID"], structure)
270 # Store as objects that won't cause a parsing error if it is left over
File src/lxml/etree.pyx:2548, in lxml.etree._Attrib.__getitem__()
KeyError: 'structureID'
The above exception was the direct cause of the following exception:
[…]
XMLParseError: KeyError: 'structureID'
…so I assume this is what you mean by "breaks".
Accessing the URL, the relevant snippet is:
<message:StructureSpecificData xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message ../../schemas/SDMXMessage.xsd">
<message:Header>
<message:Structure namespace="urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=IMF.STA:DSD_CPI(5.0.0)" dimensionAtObservation="TIME_PERIOD">
<common:Structure>
urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=IMF.STA:DSD_CPI(5.0.0)
</common:Structure>
</message:Structure>
</message:Header>
For your second example:
<message:StructureSpecificData xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message https://registry.sdmx.org/schemas/v2_1/SDMXMessage.xsd urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:WS_LONG_CPI(1.0):ObsLevelDim:TIME_PERIOD https://stats.bis.org/ws/public/sdmxapi/rest/schema/dataflow/BIS/WS_LONG_CPI/1.0?format=sdmx-2.1">
<message:Header>
<message:Structure structureID="BIS_WS_LONG_CPI_1_0" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:WS_LONG_CPI(1.0):ObsLevelDim:TIME_PERIOD" dimensionAtObservation="TIME_PERIOD">
<common:StructureUsage>
<Ref agencyID="BIS" id="WS_LONG_CPI" version="1.0"/>
</common:StructureUsage>
</message:Structure>
</message:Header>
So these seem to be two separate issues:
- In the first example, SDMX-ML 3.0, the
reader.push(elem.attrib["structureID"], structure)line should be allowed to fail, or be skipped, if that attribute is not present. I will need to recheck the schemas whether it is optional in SDMX-ML 3.0, or removed entirely. - In the second example, SDMX-ML 2.1, this is already a sub-optimal situation. The user is trying to retrieve/parse structure-specific SDMX-ML without giving the actual structure.
sdmx.reader.xmlis thus forced to infer everything about the structure.¹ Currently it guesses that the given <Ref …> is to a DataStructureDefinition. That guess could be more sophisticated, e.g. also checking the <mes:Structure namespace="..."> attribute to identify that here the <Ref …> is to a DataflowDefinition. But the code can never guess the ID/URN of the (not provided) DSD that the (not provided) DFD points to; it could be literally anything. In other words, when you say "The actual DSD has ID BIS_LONG_CPI", that isn't information available from within the message itself.
¹ As an aside, I have considered a few times the idea to automatically make a sub-query to fetch missing structure(s) when structure-specific data is to be parsed. This would add a lot of complexity, and it's not a top priority from my POV.
To fix the first issue seems straightforward. I'd welcome any thoughts on what to do for the second.
For the second issue, currently we have:
>>> response.structure
<DataStructureDefinition BIS:WS_LONG_CPI(1.0)>
>>> response.dataflow
<DataflowDefinition (missing id)>
I think giving
>>> response.structure
<DataStructureDefinition (missing id)>
>>> response.dataflow
<Dataflow BIS:WS_LONG_CPI(1.0)>
would be better, but maybe it's hard because 3.0 and 2.1 differ in type for .dataflow. having both return missing would also be ok.
Fetching missing structure would be helpful, RSDMX fetches all the supporting structures off a data query based on an optional parameter, which they use to populate labels instead of codes in their dataframe.
I agree the second would be more consistent with the contents of the message per se. The example you've given provides a good test specimen to check the behaviour, so I can work from that.
Fetching missing structure would be helpful, RSDMX fetches all the supporting structures off a data query based on an optional parameter
Yeah, the reason I say "I've considered a few times" is it's a feature that would certainly be helpful. It's only because of the complexity (+ lack of bandwidth) that I've shied away from it for now. (For instance: if the user is handling offline SDMX-ML messages from file, when should the code attempt to make an online query for related structures? Should it expect to find the structures in a file? Should it cache to avoid repeated sub-queries, e.g. if the user is too lazy to store the structures locally? And then all of these behaviours would need tests.)
In any case, I can record this as a separate, wishlist item; and then you can +1 it; and we can see if anyone wants to tackle it themselves or support its development.
Just to update, you are correct structureID is required still in SDMX 3.0 so for the first issue I will aim to fix it on our side.