iris
iris copied to clipboard
Parallelising cubes concatenation
In ESMValTool, we have some recipes that requires concatenating a long list of cubes.
This is done with CubeList.concatenate, which, as far as I understand, loops over the cubes, identifies cubes with compatible signatures, and check whether the coordinates of matching cubes are equal (in proto_cube.register):
https://github.com/SciTools/iris/blob/f8a45bed68982eb741c493aac49b99f56a208dad/lib/iris/_concatenate.py#L335-L353
As in #5743 , we observe significant slowdown when we try to act on a list of cubes with lazy coordinates. These coordinates, in fact, need to be computed (read from disk) in order to compare them with the ones of other cubes, and this is now carried out sequentially. Of course we could skip the auxiliary/derived coordinate comparison (e.g. check_aux_coords=False), but we would like to keep all checks to make sure the concatenation is robust.
We are thinking to possible approaches to speed up the concatenation for similar use cases, i.e. long lists of cubes with lazy coordinates. One way to do this would be to "realise" all coordinates, which could be done in parallel for all cubes- but this would bring quite some memory usage as a disadvantage.
Another potential strategy (suggested by my colleague @bouweandela) would be to load coordinates, hash them, and only store hashes, to be used later on for comparisons between cubes. By running the coordinate loading & hashing for all cubes in parallel one could get quite some performance improvement without significantly increasing the memory footprint.
What are your thoughts on this? How would you see the (optional) usage of hashes for the coordinate comparison when concatenating cubes?
I may be out of my depth here, but is a [direct] hash of floating point values really useful? I imagine that some rounding, possibly user-adjustable, would prevent unwarranted down-to-last-bit-agreement requirement.
Hi @larsbarring, good point indeed. But I believe that the current implementation is also based on array equality comparisons (without tolerance)? As far as I understand coordinate comparisons use iris.util.array_equal, which accounts for the potential presence of NaN's, but does not include any tolerance factor.
Yes, I am afraid that is the case (without knowing exactly how it is done), which has bitten us several times (e.g. here). Why enforce that level of precision when comparing floating point values?
Let's have a go at the hashing solution!
Aah -- this sounds promising :-)) pinging @ljoakim for info