solaris icon indicating copy to clipboard operation
solaris copied to clipboard

[FEATURE]: Faster geodataframe to instance mask function

Open rbavery opened this issue 5 years ago • 4 comments

Is your feature request related to a problem? Please describe. I tried out the solaris instance_mask function and found that it took about 19 minutes to turn a geodataframe of about ~10,000 labels into 917 512x512 tiles. I think the bulk of the overhead comes from calling rasterio.features.rasterize for every single instance:

https://github.com/CosmiQ/solaris/blob/ef86c100600fe5f066b91a345ea656592e142b3b/solaris/vector/mask.py#L945

Describe the solution you'd like Ideally this would take less than 5 minutes.

Before finding solaris I was working with a rougher implementation of this function. While solaris goes from geodataframe > multiple geojson tiles > mask tiles my implementation goes from geodataframe > single large mask > tiles > scikit-image connected components > mask tiles. This results in a "rasterize" call per tile rather than per instance label. The downside is that connected components isn't exactly the same as rasterio's rasterize, and if instances are touching by a single pixel then they will be considered as a single instance. The upside is that it ran in about 2 minutes for the same dataset.

Describe alternatives you've considered I was thinking that each call to rasterize could be wrapped by dask.delayed, since the operation is embarrassingly parallel: https://examples.dask.org/delayed.html

There could also be a function that uses the scikit-image connected components method described above. It could be named something like instance_mask_approximate. though depending on the nature of the labels this might not be very useful, mine just happen to mostly be well separated without many touching pixels, and where pixels do touch I can deal with it by doing morphological erosion and dilation.

I'm curious if there's any other ideas on how to improve the speed of this function. It'd be great to have the accuracy of the rasterio.features.rasterize method with a speed closer to connected components.

rbavery avatar Dec 16 '19 21:12 rbavery

Hey @rbavery,

It's definitely really slow at the moment. I think it's possible it's because of the size of the arrays that get generated - for example, if we're looking at a SpaceNet 4 tile with 100 buildings, that'll be an array of shape (900, 900, 100). That's going to be an enormous object stored in memory. I don't think it's something that could be resolved by making one massive mask at first and then splitting it, since that's going to be exponentially larger across each axis, becoming a truly unmanageable size.

At present, I'm a little hesitant to begin using dask.delayed for this. I'm not super familiar with dask, but since the loop is by definition sequential (i.e. the idx has to be incremented to accurately place slices), I don't know if delayed can automate that (if you know it can, please do correct me). If we wanted to parallelize using core python's multiprocessing module (a pool of workers with map(), starmap() or the like) we could potentially do this and pass [idx, feat] to each worker. This certainly doesn't help with memory, though, bringing me to my preferred solution:

Reduce memory burden by never having more than one instance slice loaded

To reduce memory burden when the mask is being saved to file, the loop that you referenced above could be extended to include writing each channel to file iteratively. In that case, we wouldn't generate output_arr anymore, and would instead only have one channel at a time. Something like:

    if out_file:
        meta = reference_im.meta.copy()
        meta.update(count=output_arr.shape[-1])
        if out_type == 'int':
            meta.update(dtype='uint8')
        with rasterio.open(out_file, 'w', **meta) as dst:
            for idx, feat in enumerate(feature_list):
                dst.write(features.rasterize([feat], out_shape=shape,
                                             transform=affine_obj),
                          indexes=idx+1)

In the case where the array is being returned directly and nothing is being written, it would have to stick with the slower implementation that currently exists, or the parallelized approach described above.

Thoughts?

nrweir avatar Dec 17 '19 15:12 nrweir

Hey Nick, to be clearer about the connected components method: I was first rasterizing all shapes into a single binary raster with features.rasterize. Connected components can take a (900,900) shaped binary raster, and turn it into a (900, 900) shaped raster where each connected region is assigned a unique label. Like so: Selection_085

But sometimes labels are too close together and get rasterized as a single connected component. Using morphological erosion to deal with this before connected components is an imperfect solution.

I also found that the function isn't as slow as I thought it was, there's an edge case where if the aoi_boundary that is used to define the tile bounds is larger than the image being tiled, empty tiles filled with no data are created where the image does not intersect with the aoi_boundary. I'm submitting a PR to fix this soon.

In my tests where I'm storing (at the largest) (512, 512, 200) label tiles, I don't see memory increase beyond 2GB so I don't know if memory is too much of an issue (at least in my limited case). This might be different for larger tile sizes and larger numbers of instances.

possible dask solution I think dask delayed can still work here, abstracting away multiprocessing. The delayed result of each call to features.rasterized could be stored in a list that preserves the increment of the for loop like so, computed, and then concatenated into an array, preserving the order of the results.

output_list = []
for idx, feat in enumerate(feature_list):
    delayed_result = dask.delayed(features.rasterize)([feat], out_shape=shape,
                                                   transform=affine_obj)
    output_list.append(delayed_result)
output_list = dask.compute(*output_list)
output_arr = np.dstack(output_list)

At the end of the day I think this would use about as much memory as the current implementation.

rbavery avatar Dec 20 '19 04:12 rbavery

OK sounds good, happy to try this approach out then! Thanks for the suggestion. If we begin using dask here, I might try to re-write some of the CLI at some point to use that instead of the underlying multiprocessing functionality since it's just generally harder to read...

nrweir avatar Dec 20 '19 15:12 nrweir

No prob! I'll get to this once the holidays wind down. I think that'd be a useful rewrite, I've found dask to be very intuitive, and it comes with a nifty dashboard for visualizing workloads.

rbavery avatar Dec 20 '19 20:12 rbavery