iris
iris copied to clipboard
Merge Problems
There appear to be a problem with the method used by Iris to determine which coordinates to use to construct new dimensions, and the shape of those dimensions, when merging. It looks like the problem stems from the fact that in order to determine if two scalar coordinates A and B can be 'separated' (can appear on separate dimensions in the new cube), Iris constructs, for each point of A, the set of points of B with which it appears on a cube in the list, and likewise for B -- and thus it ignores duplicates. It then tries to construct a shape for the new cube based on this idea of separability, which causes problems.
For example, consider four scalar cubes each with scalar coordinates A and B with points taken from the following two lists of points:
A = [1, 1, 2, 2]
B = [3, 4, 3, 4]
These four cubes can be merged into a single 2 x 2 cubes with coordinates A=[1,2], B=[3,4] on separate dimensions. To determine the new dimensions, Iris constructs the following dictionaries (called 'indexes' in the code):
IA, B = { 1: set(3, 4), 2: set(3,4) }
IB, A = { 3: set(1, 2), 4: set(1,2) }
That is, for each point in A, the set of points it appears with on a source cube, and likewise for B. It then uses the reasoning I outlined above: it compares the sets in IA, B, sees that they are all the same, and determines that A is separable from B. It does the same for IB, A. It later uses this information when determining the shape of the new cube (2,2).
Now consider six scalar cubes with coordinates based on the following points:
A = [1, 2, 1, 2, 1, 2]
B = [3, 4, 4, 4, 4, 3]
C = [5, 5, 6, 6, 7, 7]
These cubes should merge into a 3 x 2 (or 2 x 3) cube with dimension coordinates C and A (or A and C) respectively, with B as a 3 x 2 auxiliary coordinate spanning both dimensions. But Iris gets things wrong. For coordinates A and B it constructs the same index dictionaries as before (because the sets of A/B point pairs are the same), so Iris determines that A and B are separable, and tries to construct a new dimension for each of them on the result cube (I'm not clear on the logic it uses to determine the shape; in this case it tries to make it (2, 5)), which causes an exception to be raised.
There are various manifestation of the error, but I believe they all stem from the fact that Iris only looks at the set of point pairs for each pair of coordinates.
Example code:
from itertools import repeat
import iris
from iris.coords import AuxCoord
from iris.cube import Cube, CubeList
def make_scalar_coords(name, points):
return [AuxCoord(point, long_name=name) for point in points]
def make_scalar_cubes(**kwargs):
# Make a list of scalar cubes with coordinates based on the keyword arguments
all_scalar_coords = [make_scalar_coords(name, points)
for name, points in kwargs.iteritems()]
cubes = [Cube(0, aux_coords_and_dims=zip(scalar_coords, repeat(None)))
for scalar_coords in zip(*all_scalar_coords)]
return CubeList(cubes)
a = [5, 5, 6, 6, 7, 7]
b = [1, 2, 1, 2, 1, 2]
c = [3, 4, 4, 4, 4, 3]
cubes = make_scalar_cubes(a=a, b=b, c=c)
cubes.merge_cube()
Result:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
.../iris/_merge.py in merge(self, unique)
1213 stack[nd_index] = data
1214
-> 1215 merged_data = biggus.ArrayStack(stack)
1216 if all_have_data:
1217 merged_data = merged_data.masked_array()
.../biggus/_init.py in __init__(self, stack)
1731 for array in stack.flat:
1732 if not isinstance(array, Array):
-> 1733 raise ValueError('sub-array must be subclass of Array')
1734 if fill_value is not None and not fill_value_ok(array):
1735 fill_value = None
ValueError: sub-array must be subclass of Array
The cheapest fix (though I'm not necessarily advocating it) would be to capture the above error and re-raise it as a MergeError.
This came up again for me today. This time on load.
The cheapest fix (though I'm not necessarily advocating it) would be to capture the above error and re-raise it as a MergeError.
The merge codebase is somewhat beyond me, so I don't know how realistic this suggestion is: could the error be caught and then trigger a fall back approach of putting A, B and C on one anonymous dimension? It wouldn't be ideal, but it would allow the user to continue with their analysis.
I've just seen this in the wild. Ouch. Thanks for making such an easy case to reproduce Dan.
With Iris 2 and dask, the exception looks like:
Traceback (most recent call last):
File "simple_bug.py", line 27, in <module>
print(cubes.merge_cube())
File "/net/home/pelson/dev/scitools/iris/lib/iris/cube.py", line 380, in merge_cube
merged_cube, = proto_cube.merge()
File "/net/home/pelson/dev/scitools/iris/lib/iris/_merge.py", line 1236, in merge
merged_data = multidim_lazy_stack(stack)
File "/net/home/pelson/dev/scitools/iris/lib/iris/_lazy_data.py", line 159, in multidim_lazy_stack
for subarray in stack])
File "/net/home/pelson/dev/scitools/iris/lib/iris/_lazy_data.py", line 159, in <listcomp>
for subarray in stack])
File "/net/home/pelson/dev/scitools/iris/lib/iris/_lazy_data.py", line 155, in multidim_lazy_stack
result = da.stack(list(stack))
File "/scratch/pelson/conda/envs/dev-iris-2/lib/python3.6/site-packages/dask/array/core.py", line 3297, in stack
if not all(x.shape == seq[0].shape for x in seq):
File "/scratch/itpe/conda/envs/dev-iris-2/lib/python3.6/site-packages/dask/array/core.py", line 3297, in <genexpr>
if not all(x.shape == seq[0].shape for x in seq):
AttributeError: 'NoneType' object has no attribute 'shape'
Thinking about this a little.. the case that breaks down appears be solveable when the separable test maintains each of the individual values in the proposed dimension coordinate relationship i.e.
IA,B = { 1: [3, 4, 4], 2: [3, 4, 4] }
IB,A = { 3: [1, 2], 4: [1, 1, 2, 2] }
Clearly this fails, and quit right too! This is what we want BTW, as Iris gets this wrong currently, and that's the crux of this particular issue.
However,
IA,C = { 1: [5, 6, 7], 2: [5, 6, 7] }
IC,A = { 5: [1, 2], 6: [1, 2], 7: [1, 2] }
most certainly works (that's very good news indeed). It even gives the correct shape, (3, 2) or (2, 3). And for completeness,
IB,C = { 3: [5, 7], 4: [5, 6, 7] }
IC,B = { 5: [3, 4], 6: [4, 4], 7: [3, 4] }
also fails, which is what we want. Interesting, huh?
There may be some side-effect in this subtle change of approach, but it seems to have enough potential to warrant me applying this to merge to see how she flies...
And if it does, I'll target the 2.2.x branch, as technically this would be a bugfix.
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.
This one is difficult to work around, so I would very much like to see a fix for it.
@SciTools/peloton 👁️
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.