to_pandas mixing up fields when optional series-level attributes are present
When the package doesn't know the structure of a dataset, it puts series-level attributes in the series key. This is problematic when the series-level attributes are optional then you can have series keys of different lengths in the same data frame. Ideally the package would check the IDs in the series key to insure fields are placed in the right location. Please see example below where SCALE is an optional parameter, the first series does not have scale and populates the TIME_PERIOD in the scale column. Note I have not checked if this issue also happens when the structure is known.
import sdmx
IMF_DATA = sdmx.Client('IMF_DATA')
key='NAM.A.P_TOTINV_P_USD.S1.S1.BEL.A+S')
data_msg = IMF_DATA.data('PIP', key='NAM.A.P_TOTINV_P_USD.S1.S1.BEL.A+S')
pip_df = sdmx.to_pandas(data_msg)
print(pip_df.head())
which outputs
INDICATOR ACCOUNTING_ENTRY COUNTRY SECTOR COUNTERPART_SECTOR COUNTERPART_COUNTRY FREQUENCY SCALE TIME_PERIOD
P_TOTINV_P_USD A NAM S1 S1 BEL A 2020 NaN NaN
2021 NaN NaN
2022 NaN NaN
2023 NaN NaN
S 6 2020-S2 NaN
Name: value, dtype: float64
here is the relevant api call https://api.imf.org/external/sdmx/2.1/data/IMF.STA,PIP/NAM.A.P_TOTINV_P_USD.S1.S1.BEL.A+S
Thanks for the concise example.
As you note (and as with one of the two cases you reported in #240), the issue is that the package is forced to guess the underlying data structure when the user does not provide it. The SDMX-ML reader here calls BaseDataStructureDefinition.make_key(…, extend=True):
https://github.com/khaeru/sdmx/blob/59a90ecf01955768d13ae7351cd61e41c5551f2c/sdmx/reader/xml/v21.py#L1109-L1112
(extend=True because "SS without structure" is the flag for this situation) …which signals the function here to create new Dimensions as needed to create the key: https://github.com/khaeru/sdmx/blob/59a90ecf01955768d13ae7351cd61e41c5551f2c/sdmx/model/common.py#L1408-L1411
Without the structure info, the code can't know in advance which new identifiers are for dimensions that are part of the series key vs. series-attached attributes.
For example, seeing series keys in this order:
<Series COUNTRY="NAM" … FREQUENCY="A">
<Series COUNTRY="NAM" … FREQUENCY="S" SCALE="6">
…the logic could be "All IDs (COUNTRY … FREQUENCY) appearing in the first seen key are Dimensions; any that appear later (SCALE) are Attributes."
But then if the data happened to instead have the opposite order:
<Series COUNTRY="NAM" … FREQUENCY="S" SCALE="6">
<Series COUNTRY="NAM" … FREQUENCY="A">
…it would expect SCALE to be present in the second SeriesKey, and would have to rewind and revise the DSD and all previously-seen SeriesKeys to change SCALE from a KeyValue to an AttributeValue.
Another option would be to defer this logic/determination to the end of the DataSet, when the SeriesKeys are collected and attached. I'm not sure which would be more performant for big data sets.
Some possible resolutions:
- Give warnings when SeriesKeys have mismatched lengths, as this may indicate this kind of situation.
- Adjust the .reader.xml code as described above.
- Adjust the .writer.pandas code to produce NaNs in the correct index dimensions.
(2) could be a lot of complexity to introduce just to cover for non-standard usage. On the other hand (3) forces the .writer.pandas code to handle structures that may not even be valid per the SDMX Information Model (i.e. I don't know if the IM requires all SeriesKeys within a DataSet to have the same number of associated Dimensions).
I'll have to think a bit more about what is the best resolution.
In SDMX 3.X, the response is not limited to a single DSD. For example the REST documentation gives an example "Retrieve the list of indicators about Switzerland (CH) available in the source:"
https://ws-entry-point/data/?c[REF_AREA]=CH&attributes=none&measures=none
There is also the SDMX 3.1 feature of allowing evolving structure so two dataflows based off the same dsd can have a different number of dimensions and if you query them in the same call as dataflows i would imagine the response would also have differing numbers of dimensions.
We currently don't support either of these features, but I believe FMR does and .STAT likely will supporting the evolving structure soon (it was created to support the UN's SDGs).