iris Parallelising cubes concatenation

In ESMValTool, we have some recipes that requires concatenating a long list of cubes.

This is done with CubeList.concatenate, which, as far as I understand, loops over the cubes, identifies cubes with compatible signatures, and check whether the coordinates of matching cubes are equal (in proto_cube.register):

https://github.com/SciTools/iris/blob/f8a45bed68982eb741c493aac49b99f56a208dad/lib/iris/_concatenate.py#L335-L353

As in #5743 , we observe significant slowdown when we try to act on a list of cubes with lazy coordinates. These coordinates, in fact, need to be computed (read from disk) in order to compare them with the ones of other cubes, and this is now carried out sequentially. Of course we could skip the auxiliary/derived coordinate comparison (e.g. check_aux_coords=False), but we would like to keep all checks to make sure the concatenation is robust.

We are thinking to possible approaches to speed up the concatenation for similar use cases, i.e. long lists of cubes with lazy coordinates. One way to do this would be to "realise" all coordinates, which could be done in parallel for all cubes- but this would bring quite some memory usage as a disadvantage.

Another potential strategy (suggested by my colleague @bouweandela) would be to load coordinates, hash them, and only store hashes, to be used later on for comparisons between cubes. By running the coordinate loading & hashing for all cubes in parallel one could get quite some performance improvement without significantly increasing the memory footprint.

What are your thoughts on this? How would you see the (optional) usage of hashes for the coordinate comparison when concatenating cubes?

Feb 16 '24 10:02 fnattino

I may be out of my depth here, but is a [direct] hash of floating point values really useful? I imagine that some rounding, possibly user-adjustable, would prevent unwarranted down-to-last-bit-agreement requirement.

Feb 16 '24 13:02 larsbarring

Hi @larsbarring, good point indeed. But I believe that the current implementation is also based on array equality comparisons (without tolerance)? As far as I understand coordinate comparisons use iris.util.array_equal, which accounts for the potential presence of NaN's, but does not include any tolerance factor.

Feb 20 '24 14:02 fnattino

Yes, I am afraid that is the case (without knowing exactly how it is done), which has bitten us several times (e.g. here). Why enforce that level of precision when comparing floating point values?

Feb 20 '24 15:02 larsbarring

Let's have a go at the hashing solution!

Feb 29 '24 14:02 trexfeathers

Aah -- this sounds promising :-)) pinging @ljoakim for info

Feb 29 '24 14:02 larsbarring

iris iris copied to clipboard

Parallelising cubes concatenation

iris
iris copied to clipboard