iris Rationalise CML Testing

assert_CML() and assert_CML_approx_data() are great conveniences for capturing Known Good Output - "we are happy with the way this Cube looks right now, and don't want it to change".

But there are some problems to address:

Why do we have both functions? When is one more appropriate than the other? All developers should have clarity on this, I guess via docstrings. Is there any reason we can't just use approx_data for everything, which would reduce our complexity? (I believe the current mix is not by design, but mostly due to us copying patterns from existing tests)
Can the routine be changed or replaced with something that isn't so vulnerable to changes in NumPy's array printing? We occasionally have to create large pull requests that simply 'reset' the Known Good Output to align with the latest behaviour - these are a waste of developer time and leave large marks on the commit history. Past examples:
- #5235
- #6035

Dec 09 '24 11:12 trexfeathers

Hashing arrays may be a solution, so long as the hashing is tolerant of minor changes.

Dec 11 '24 10:12 trexfeathers

From discussion with @HGWright and @ESadek-MO and @trexfeathers: string representation of NumPy arrays is not expected to be stable - there is no 'contract' for this, so long as a human can read the string then it's fine. That is not a sensible basis for consistent testing. The .npy file format IS designed to be stable and backward compatible - the array is stored in its native binary format.

So: we could keep the CML format, but each CML file should be stored in a directory, together with .npy files for each NumPy array in the Cube. No arrays are stored in the CML file, only references to the relevant .npy files. We would need to write a basic CML parser for this to work.

As well as removing vulnerability to breaking changes in string format, storing NumPy arrays natively would allow use of tests._shared_utils.assert_array_almost_equal(), which offers tolerance options, and further standardises our testing.

Dec 11 '24 15:12 trexfeathers

So: we could keep the CML format, but each CML file should be stored in a directory, together with .npy files for each NumPy array in the Cube. No arrays are stored in the CML file, only references to the relevant .npy files. We would need to write a basic CML parser for this to work.

TBH I don't think this preserves the basic utility of what CML offers. I think what we really want is a human-readable output that enables you to see what changed when something changed, but allows some floating-point equality tolerance.
So I'd prefer a programmed summary of floating-point content, more like what we now have in cube printouts. Unfortunately, I do have to admit that that still relies on numpy formatting, but I think to be fair this one-line summary approach has proved much more stable than the solutions we had previously. For choice, I would record + compare the first + last 'N' points (? N=3 ?), and the sum, that would probably be good enough.

Dec 12 '24 10:12 pp-mo

TBH I don't think this preserves the basic utility of what CML offers. I think what we ... I would record + compare the first + last 'N' points (? N=3 ?), and the sum

I realise I'm not sure if we're comparing really big arrays in this way, or is it "just" coordinates etc. Or maybe we just avoid anything too big. In which case all points might make good sense,

Dec 12 '24 10:12 pp-mo

@pp-mo to provide tolerance while still providing human readability, I think we would need to write our own parser, perhaps based on numpy.fromstring() but supporting N dimensions. To provide the stability we need, we would also need to write our own string representation so that we are not vulnerable to changes in NumPy, again supporting N dimensions.

I'm sceptical on whether this would be worth the extra effort. In the graphics tests we have a well established precedent for tests that need 5mins extra work to see what has changed, and that workflow is easy enough to cope with. It would be similarly easy to work out changes in NumPy arrays by running the offending test locally.

I'd also argue that our current printing of arrays in CML files actually harms readability in some cases. Once an array gets beyond 2 dimensions it is tricky to understand the differences I am looking at. Maybe I need more practice.

Dec 12 '24 16:12 trexfeathers

Would have prevented a lot of the problems on #6518

Jun 19 '25 09:06 trexfeathers

Another thing I forgot to note was a previous conversation with @stephenworsley, who is also supportive of the test results being human-readable, so that's another vote:

Against hashing arrays or saving .npy files
For making the effort to develop our own string representation of arrays

Jun 19 '25 09:06 trexfeathers

Another thing I forgot to note was a previous conversation with @stephenworsley, who is also supportive of the test results being human-readable, so that's another vote:

Against hashing arrays or saving .npy files

For making the effort to develop our own string representation of arrays

I was reminded that when we re-wrote the cube print/summary code, we did produce our own code to print a summary of real array data for human consumption, which I think has since proved itself pretty stable. That is buried inside _DimensionalMetadata.summary, and its inner routine array_summary. It is written with some flexible control parameters. However, I see it still uses numpy.array2string :-(

Jun 23 '25 10:06 pp-mo

The multi-dimensional array parsing could perhaps be offloaded to the json parse? That would just leave the formatting to be handled by something bespoke in Iris (i.e. an alternative to np.array2string)

Sep 17 '25 14:09 ukmo-ccbunney

The multi-dimensional array parsing could perhaps be offloaded to the json parse? That would just leave the formatting to be handled by something bespoke in Iris (i.e. an alternative to np.array2string)

Sounds like a good idea. I suppose this would partly depend on how stable the output of that is; they key is whether NumPy/CPython care about the machine-readability (as opposed to string representations where NumPy only cares about human-readability).

Sep 17 '25 14:09 trexfeathers

TBH I don't think this preserves the basic utility of what CML offers. I think what we ... I would record + compare the first + last 'N' points (? N=3 ?), and the sum

I realise I'm not sure if we're comparing really big arrays in this way, or is it "just" coordinates etc. Or maybe we just avoid anything too big. In which case all points might make good sense,

AFAIK we don't actually store the cube data in the CML in its entirety. We either store a single checksum hash to compare against, or an associated JSON file with some stats (mean, min, max, shape, etc.)

So, when we are talking about formatting of arrays, this mainly applies to coordinate data (and even then we store a potentially truncated representation with "N-edge" values).

Sep 17 '25 16:09 ukmo-ccbunney

After discussion with the rest of AVD, the following is proposed for handling array output formatting in CML files:

Output a checksum for all arrays (coords too, not just data payload)
Optional statistics (as per approx_data keyword) output directly in CML, rather than separate JSON file.
Rather than using numpy.arr2str to output arrays, generate a simple "summary view" of the array on a single line. This would be the first and last N values of the flattened array, formatted using simple python f-strings.
- This summary along with the checksum/stats, shape and dtype should provide enough information on the consistency of the data without relying on any external formatting.

The option of having the number of masked points in the array was also floated (although this would be captured in the checksum). Potentially a useful bit of information though to aid in debugging differences in CML.

Oct 01 '25 13:10 ukmo-ccbunney

iris iris copied to clipboard

Rationalise CML Testing

iris
iris copied to clipboard