climate_indices Replace loops with numpy broadcasting for optimization

Replace loops with numpy broadcasting for optimization

Open monocongo opened this issue 6 years ago • 1 comments

In many places there are loops that can/should be replaced by broadcasting and/or other tricks available via numpy/pandas capabilities. The lowest hanging fruit is probably the percent of normal precipitation index.

May 17 '18 01:05 monocongo

Initial testing has given mixed results, depending upon whether or not Numba is enabled or not. With Numba disabled we get about fifteen times speedup using broadcasting (see new code below), but with Numba in play the broadcasting roughly doubles the execution time of the code with loops. To wit:

$ export NUMBA_DISABLE_JIT=1 $ python time_test.py Previous PNP time: 1.6137288540073078 New PNP time: 0.107064626452404 $ unset NUMBA_DISABLE_JIT $ python time_test.py Previous PNP time: 0.06601208387594348 New PNP time: 0.1309336793395115

(This may have to do with my new code not being as optimal as it could be -- I'm not a numpy wizard.)

New code for PNP, with broadcasting:

#-------------------------------------------------------------------------------------------------------------------------------------------
@numba.jit     
def percentage_of_normal(values, 
                         scale,
                         data_start_year,
                         calibration_start_year,
                         calibration_end_year,
                         time_series_type):
    '''
    This function finds the percent of normal values (average of each calendar month or day over a specified 
    calibration period of years) for a specified time steps scale. The normal precipitation for each calendar time step 
    is computed for the specified time steps scale, and then each time step's scaled value is compared against the 
    corresponding calendar time step's average to determine the percentage of normal. The period that defines the 
    normal is described by the calibration start and end years arguments. The calibration period typically used  
    for US climate monitoring is 1981-2010. 
    
    :param values: 1-D numpy array of precipitation values, any length, initial value assumed to be January of the data 
                   start year (January 1st of the start year if daily time series type), see the description of the 
                   *time_series_type* argument below for further clarification
    :param scale: integer number of months over which the normal value is computed (eg 3-months, 6-months, etc.)
    :param data_start_year: the initial year of the input monthly values array
    :param calibration_start_year: the initial year of the calibration period over which the normal average for each  
                                   calendar time step is computed 
    :param calibration_start_year: the final year of the calibration period over which the normal average for each 
                                   calendar time step is computed 
    :param time_series_type: the type of time series represented by the input data, valid values are 'monthly' or 'daily'
                             'monthly': array of monthly values, assumed to span full years, i.e. the first value 
                             corresponds to January of the initial year and any missing final months of the final year 
                             filled with NaN values, with size == # of years * 12
                             'daily': array of full years of daily values with 366 days per year, as if each year were 
                             a leap year and any missing final months of the final year filled with NaN values, 
                             with array size == (# years * 366)
    :return: percent of normal precipitation values corresponding to the scaled precipitation values array   
    :rtype: numpy.ndarray of type float
    '''

    # bypass processing if all values are masked    
    if np.ma.is_masked(values) and values.mask.all():
        return values
    
    # if doing monthly then we'll use 12 periods, corresponding to calendar months, if daily assume years w/366 days
    if time_series_type == 'monthly':
        periodicity = 12
    elif time_series_type == 'daily':
        periodicity = 366
    else:
        message = 'Invalid time series type argument: \'{0}\''.format(time_series_type)
        _logger.error(message)
        raise ValueError(message)
    
    # make sure we've been provided with sane calibration limits
    if data_start_year > calibration_start_year:
        raise ValueError('Invalid start year arguments (data and/or calibration): calibration start year ' + \
                         'is before the data start year')
    elif ((calibration_end_year - calibration_start_year + 1) * 12) > values.size:
        raise ValueError('Invalid calibration period specified: total calibration years exceeds the actual ' + \
                         'number of years of data')
        
    # get an array containing a sliding sum on the specified time step scale -- i.e. if the scale is 3 then the first 
    # two elements will be np.NaN, since we need 3 elements to get a sum, and then from the third element to the end 
    # the values will equal the sum of the corresponding time step plus the values of the two previous time steps
    scale_sums = compute.sum_to_scale(values, scale)
    
    # reshape into a 2-D array with the first axis representing years, 
    # i.e. (years, 12) for monthly, or (years, 366) for daily  
    scale_sums = utils.reshape_to_2d(scale_sums, periodicity)
    
    # extract the time steps over which we'll compute the normal average for each time step of the year
    calibration_years = calibration_end_year - calibration_start_year + 1
    calibration_start_index = calibration_start_year - data_start_year
    calibration_end_index = calibration_start_index + calibration_years
    calibration_period_sums = scale_sums[calibration_start_index:calibration_end_index]
    
    # for each time step in the calibration period, get the average of the scaled sums 
    # for that time step (i.e. average all January sums, then all February sums, etc.) 
    averages = np.nanmean(calibration_period_sums, axis=0)
    
    # reshape the averages from 1-D to 2-D so it's in proper shape for the broadcasting we'll do below
    averages = np.reshape(averages, (1, averages.size))
    
    # divide each value by it's corresponding time step average
    percentages_of_normal = scale_sums / averages
        
    return percentages_of_normal.flatten()

Timing code:

import timeit
 
#--------------------------------------------------------------------------------- 
# compute PNP time using previous version with loops
def previous_pnp():
    SETUP_CODE = '''
import climate_indices.indices as indices
import tests.fixtures as fixtures
precips = fixtures.FixturesTestCase.fixture_precips_mm_monthly.flatten()
data_year_start = fixtures.FixturesTestCase.fixture_data_year_start_monthly
calibration_year_start = fixtures.FixturesTestCase.fixture_calibration_year_start_monthly
calibration_year_end = fixtures.FixturesTestCase.fixture_calibration_year_end_monthly
'''
 
    TEST_CODE = '''
indices.previous_percentage_of_normal(precips, 
                                      6, 
                                      data_year_start, 
                                      calibration_year_start, 
                                      calibration_year_end, 
                                      'monthly')
'''
     
    # timeit.repeat statement
    times = timeit.repeat(setup = SETUP_CODE,
                          stmt = TEST_CODE,
                          repeat = 3,
                          number = 1000)
 
    # print the minimum execution time
    print('Previous PNP time: {}'.format(min(times)))        
 
#--------------------------------------------------------------------------------- 
# compute PNP time using new version with numpy broadcasting
def new_pnp():
    SETUP_CODE = '''
import climate_indices.indices as indices
import tests.fixtures as fixtures
precips = fixtures.FixturesTestCase.fixture_precips_mm_monthly.flatten()
data_year_start = fixtures.FixturesTestCase.fixture_data_year_start_monthly
calibration_year_start = fixtures.FixturesTestCase.fixture_calibration_year_start_monthly
calibration_year_end = fixtures.FixturesTestCase.fixture_calibration_year_end_monthly
'''
 
    TEST_CODE = '''
indices.percentage_of_normal(precips, 
                             6, 
                             data_year_start, 
                             calibration_year_start, 
                             calibration_year_end, 
                             'monthly')
'''
     
    # timeit.repeat statement
    times = timeit.repeat(setup = SETUP_CODE,
                          stmt = TEST_CODE,
                          repeat = 3,
                          number = 1000)
 
    # print the minimum execution time
    print('New PNP time: {}'.format(min(times)))        

#--------------------------------------------------------------------------------- 
if __name__ == "__main__":
    previous_pnp()
    new_pnp()

May 17 '18 02:05 monocongo

climate_indices climate_indices copied to clipboard

Replace loops with numpy broadcasting for optimization

climate_indices
climate_indices copied to clipboard