Masked array performance

Open bjlittle opened this issue 4 years ago • 3 comments

Given https://github.com/SciTools/iris/pull/3301 and https://github.com/SciTools-incubator/iris-agg-regrid/pull/37 we should consider whether masked arrays with no masked points should be automatically converted to ordinary ndarrays for obvious performance benefits.

Note that python-netcdf4 now automatically always returns masked arrays.

This could be considered as a breaking change in iris behaviour, and so puts it plum in 3.0.0 territory.

As a user of iris, I want comparable regridding performance when using masked array data where no points are masked (i.e. mask=False) and non-masked arrays. I would also like to have the ability to apply this functionality to other areas of my code if I deem it necessary.

I would like the capacility to be avilable to users to apply else where in their code.

Acceptance Criteria

[ ] Create a function/decorator that can be applied to my iris function to give equivalent performance if my cube data is masked, but the mask is False.
[ ] The output cube data should remain masked if it was originally masked. Remain unmasked if originally unmasked.
[ ] The function/decorator is applied to appropriate iris regridding methods.
[ ] Have an ASV benchmark to prove the benefits of the solution.
[ ] The new functionality should be in the what's new.
[ ] The new functionality is documented in the user guide.

Oct 17 '19 14:10 bjlittle

masked array data where no points are masked (i.e. mask=False)

That "i.e." isn't quite an equation : You can still have a mask array, but with no True values. For operations where masks are costly, like regridding, it will still be worth the cost of a test + convert. But it probably needs to be on a case-by-case basis. We should have some common code for it, though.

Nov 11 '19 17:11 pp-mo

... meanwhile, could we not routinely convert the mask=False cases on loading, if that is generally useful ??

Nov 11 '19 17:11 pp-mo

This week I learned that a whole new implementation of masked arrays is afoot: https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md

Note that the owner of that repo is a numpy "member" and another member indicated a consensus for replacing numpy masked arrays "at some point" with something developed outside numpy: https://github.com/numpy/numpy/pull/16022#issuecomment-618616577

Oct 16 '20 08:10 rcomer

First steps of investigation

The netcdf4 package represents unmasked data using a masked array with a full array of masks - [False] with a shape matching the underlying data array. Some performance can be gained by setting the mask to a single False value, thanks to optimisations within NumPy itself for 'non-masked masked arrays' (thanks to @rcomer for this example).

There is however a larger amount of performance to be gained if we can operate on pure NumPy arrays. This may be because implementations in numpy.ma are sometimes piecemeal, so the non-masked optimisation might not be fully implemented. I unfortunately don't have the available resource to investigate within NumPy itself, so I will further pursue the idea of converting to NumPy arrays while operations are being performed.

How I investigated

SciTools/iris-agg-regrid#37 seemed like a good lead, so I used necromancy to get iris-agg-regrid working again and ran the below script.

Expand for the script

from time import time

import iris
from iris.cube import Cube
import numpy as np
from numpy.random import random
from numpy import ma

from agg_regrid import AreaWeighted


def main():
    global_air_temp = iris.load_cube(iris.sample_data_path('air_temp.pp'))
    rotated_psl = iris.load_cube(iris.sample_data_path('rotated_pole.nc'))

    def upsample_cube(cube_in: Cube):
        upsample_factor = 10

        new_dim_coords = []
        for coord_in in cube_in.dim_coords:
            points = coord_in.points
            new_points = np.linspace(points.min(), points.max(), len(points) * upsample_factor)
            new_dim_coords.append(coord_in.copy(points=new_points))

        new_shape = [len(c.points) for c in new_dim_coords]
        new_data = random(new_shape)

        return Cube(
            data=new_data,
            dim_coords_and_dims=[(c, ix) for ix, c in enumerate(new_dim_coords)]
        )

    global_air_temp = upsample_cube(global_air_temp)
    rotated_psl = upsample_cube(rotated_psl)

    for coord_name in ("longitude", "latitude"):
        src_coord = global_air_temp.coord(coord_name)
        tgt_coord = rotated_psl.coord(f"grid_{coord_name}")
        for coord in (src_coord, tgt_coord):
            coord.guess_bounds()

    area_weighted = AreaWeighted()

    def time_regridding(src_cube: Cube):
        regridder = area_weighted.regridder(src_cube, rotated_psl)
        time_start = time()
        _ = regridder(src_cube)
        time_elapsed = time() - time_start
        print(f"Elapsed seconds: {time_elapsed}")

    print("Basic NumPy ...")
    global_air_temp.data = np.asarray(global_air_temp.data)
    time_regridding(global_air_temp)

    print("Simple masked ...")
    global_air_temp.data = ma.asarray(global_air_temp.data)
    time_regridding(global_air_temp)

    print("Full masked ...")
    full_mask = np.broadcast_to(np.array(False), global_air_temp.data.shape)
    global_air_temp.data = ma.array(global_air_temp.data.data, mask=full_mask)
    time_regridding(global_air_temp)


if __name__ == "__main__":
    main()

Results:

Basic NumPy ...
Elapsed seconds: 1.890709638595581
Simple masked ...
Elapsed seconds: 1.89862060546875
Full masked ...
Elapsed seconds: 1.6773123741149902

Then I disabled the conversion introduced in SciTools/iris-agg-regrid#37, which produced the below results:

Basic NumPy ...
Elapsed seconds: 1.7257087230682373
Simple masked ...
Elapsed seconds: 4.838908910751343
Full masked ...
Elapsed seconds: 5.676675796508789

Feb 06 '23 16:02 trexfeathers

Possible approaches

Convert all inputs of a function, re-convert on output

Pros

Clear to the developer applying the decorator

Cons

Non-masked masked arrays can either be masked with an array of False, or with a single np.ma.nomask value. Difficult/impossible to know which masking strategy to apply when re-converting. Could skip conversion if presented with a mixture?

Array sub-classes

Pros

Obvious signal that they should be converted back
Can store the appropriate masking strategy to re-apply

Cons

How to re-convert? Ideally check every return value, but what if function includes object modification too?
More complex development pattern

A 'register' of objects to re-convert

Don't know how to implement this. A register of Cubes would work - can run my_cube.data = - but this concept is most useful if it can also work on straight arrays, and these are not contained by another object so assignment isn't possible.

Feb 08 '23 15:02 trexfeathers

I believe #5178 proves that, while there is a case for optimisation in specific cases, there isn't much benefit in a generic way of temporarily un-masking. Happy to open if others think differently.

Dec 20 '23 10:12 trexfeathers

iris iris copied to clipboard

Masked array performance

First steps of investigation

How I investigated

Possible approaches

Convert all inputs of a function, re-convert on output

Pros

Cons

Array sub-classes

Pros

Cons

A 'register' of objects to re-convert

iris
iris copied to clipboard