pycoal
pycoal copied to clipboard
Use dask to speed up SAM algorithm in mineral.py
The SAM algorithm in mineral.py takes the majority of time when trying to classify an image, specifically these loops.
# for each pixel in the image
for x in range(M):
for y in range(N):
# read the pixel from the file
pixel = data[x,y]
# if it is not a no data pixel
if not numpy.isclose(pixel[0], -0.005) and not pixel[0]==-50:
# resample the pixel ignoring NaNs from target bands that don't overlap
# TODO fix spectral library so that bands are in order
resampled_pixel = numpy.nan_to_num(resample(pixel))
# calculate spectral angles
angles = spectral.spectral_angles(resampled_pixel[numpy.newaxis,
numpy.newaxis,
...],
library.spectra)
# normalize confidence values from [pi,0] to [0,1]
for z in range(angles.shape[2]):
angles[0,0,z] = 1-angles[0,0,z]/math.pi
# get index of class with largest confidence value
index_of_max = numpy.argmax(angles)
# get confidence value of the classied pixel
score = angles[0,0,index_of_max]
# classify pixel if confidence above threshold
if score > threshold:
# index from one (after zero for no data)
classified[x,y] = index_of_max + 1
if scores_file_name is not None:
# store score value
scored[x,y] = score
Speeding up this method with parallelization should prove beneficial in reducing runtimes. I think that trying the Dask module would be a good start to speeding up the process. https://github.com/dask/dask https://dask.org/
Hi @aheermann can you put this on the agenda for the next meeting? I am really keen to see what your plan for this is. Also, it might be appropriate for us to split this into smaller tasks... this may end up a pretty large undertaking.
Yep, I'll put it on the agenda. As to the undertaking, our idea for this was to just do some preliminary investigation and trials with this module, to see if it could work. We also have Jonathan and Dennis investigating using Pytorch for parallelization of the same code, so that we move forward with the most appropriate module
Excellent
Early branch available at https://github.com/capstone-coal/pycoal/tree/dask_trial
Thus far, we have been working on the SAM algorithm, trying to speed up pixel classification. We have tried several ways of splitting up the pixel processing into dask delayed methods in order to parallelize it. However, the overhead on the smaller data set we are using has not led to any speed ups yet. We are running on the f180201t01p00r05rdn_e_sc01_ort_img.hdr image, which using the original master branch as a baseline, runs about 3 hours 25 min un-parallelized on my machine.
@aheermann can you please hyperlink the dataset. A few more questions
which has a baseline un-parallelized runtime
Do you mean pycoal master branch? If not then this is not much to worry about as this is to be expected. Please provide more details. Thanks
Since the last update, dask was temporarily put on hold as our personal machines were not powerful enough to take advantage of it. As we now have access to AWS, we will pick back up work on dask. It will now be one option of several, including Pytorch (#172) and Joblib (#177) for users when running Pycoal.
@aheermann got it. Thinking about the abstraction layer here is an important part of engineering a good solution. Please start thinking about that. It will require you to work with other in the group.