pandas-datareader icon indicating copy to clipboard operation
pandas-datareader copied to clipboard

_parse_dimensions() messes up data for non-ordered single indicies

Open BenediktAllendorf opened this issue 4 years ago • 0 comments

This seems like a bad bug because it messes up your data, and you might not even notice.

If a dimension is not returned in order (e.g. ['1960', ..., '1969', '1950', .... '1959']) and only one index is applicable, your data is mixed. This happens because the .levels member of Pandas's MultiIndex contains the indices sorted, no matter which order they were original. For the order, there is a second member: .codes (since 0.24.0, not sure how that was handled before).

Basically, this line breaks it: https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/io/jsdmx.py#L115:

if len(arrays) == 1 and isinstance(midx, pd.MultiIndex):
    # Fix for pandas >= 0.21
    midx = midx.levels[0]

because it takes the ordered values (e.g., alphanumeric), not the presented order.

To see this in action, try to load the HISTPOP data from OECD. If you compare it with this data: https://stats.oecd.org/Index.aspx?DataSetCode=HISTPOP, you can see that pandas-datareader shifts all the data (that is only available from 1960) to 1950, and therefore, it looks like you only have data up to 2010. (If you only request data from the API with a start date after 1959, this does not happen).

Correct: image

Incorrect (value for 2018 is where value for 2008 should be, etc.): image

Apparently, this can be fixed by removing the "fix" in line 115 completely (not sure why it is there or if it is still needed) or by changing it to midx = pd.Index([midx.levels[0][x] for x in midx.codes[0]]) (but that only works for pandas >= 0.24).

BenediktAllendorf avatar Nov 10 '20 12:11 BenediktAllendorf