pycoal icon indicating copy to clipboard operation
pycoal copied to clipboard

Use dask to speed up SAM algorithm in mineral.py

Open aheermann opened this issue 4 years ago • 8 comments

The SAM algorithm in mineral.py takes the majority of time when trying to classify an image, specifically these loops.

# for each pixel in the image
    for x in range(M):

        for y in range(N):

            # read the pixel from the file
            pixel = data[x,y]

            # if it is not a no data pixel
            if not numpy.isclose(pixel[0], -0.005) and not pixel[0]==-50:

                # resample the pixel ignoring NaNs from target bands that don't overlap
                # TODO fix spectral library so that bands are in order
                resampled_pixel = numpy.nan_to_num(resample(pixel))

                # calculate spectral angles
                angles = spectral.spectral_angles(resampled_pixel[numpy.newaxis,
                                                                 numpy.newaxis,
                                                                 ...],
                                                  library.spectra)

                # normalize confidence values from [pi,0] to [0,1]
                for z in range(angles.shape[2]):
                    angles[0,0,z] = 1-angles[0,0,z]/math.pi

                # get index of class with largest confidence value
                index_of_max = numpy.argmax(angles)

                # get confidence value of the classied pixel
                score = angles[0,0,index_of_max]

                # classify pixel if confidence above threshold
                if score > threshold:

                    # index from one (after zero for no data)
                    classified[x,y] = index_of_max + 1

                    if scores_file_name is not None:
                        # store score value
                        scored[x,y] = score

Speeding up this method with parallelization should prove beneficial in reducing runtimes. I think that trying the Dask module would be a good start to speeding up the process. https://github.com/dask/dask https://dask.org/

aheermann avatar Sep 18 '19 20:09 aheermann

Hi @aheermann can you put this on the agenda for the next meeting? I am really keen to see what your plan for this is. Also, it might be appropriate for us to split this into smaller tasks... this may end up a pretty large undertaking.

lewismc avatar Sep 18 '19 22:09 lewismc

Yep, I'll put it on the agenda. As to the undertaking, our idea for this was to just do some preliminary investigation and trials with this module, to see if it could work. We also have Jonathan and Dennis investigating using Pytorch for parallelization of the same code, so that we move forward with the most appropriate module

aheermann avatar Sep 18 '19 22:09 aheermann

Excellent

lewismc avatar Sep 19 '19 00:09 lewismc

Early branch available at https://github.com/capstone-coal/pycoal/tree/dask_trial

lewismc avatar Sep 27 '19 19:09 lewismc

Thus far, we have been working on the SAM algorithm, trying to speed up pixel classification. We have tried several ways of splitting up the pixel processing into dask delayed methods in order to parallelize it. However, the overhead on the smaller data set we are using has not led to any speed ups yet. We are running on the f180201t01p00r05rdn_e_sc01_ort_img.hdr image, which using the original master branch as a baseline, runs about 3 hours 25 min un-parallelized on my machine.

aheermann avatar Oct 01 '19 01:10 aheermann

@aheermann can you please hyperlink the dataset. A few more questions

which has a baseline un-parallelized runtime

Do you mean pycoal master branch? If not then this is not much to worry about as this is to be expected. Please provide more details. Thanks

lewismc avatar Oct 01 '19 13:10 lewismc

Since the last update, dask was temporarily put on hold as our personal machines were not powerful enough to take advantage of it. As we now have access to AWS, we will pick back up work on dask. It will now be one option of several, including Pytorch (#172) and Joblib (#177) for users when running Pycoal.

aheermann avatar Oct 17 '19 22:10 aheermann

@aheermann got it. Thinking about the abstraction layer here is an important part of engineering a good solution. Please start thinking about that. It will require you to work with other in the group.

lewismc avatar Oct 17 '19 23:10 lewismc