Speed up parallel from_group_dataframe

Open tRosenflanz opened this issue 11 months ago • 1 comments

Is your feature request related to a current problem? Please describe. Parallel timeseries.from_group_dataframe is currently passing sub_df around per group which can be slow when there are lot of groups to process due to the parallelization overheads.

Describe proposed solution Instead of processing each group individually, split the initial dataframe into n_jobs chunks and process each of those chunks sequentially (i.e with n_jobs=1). This way each worker get many groups at once and each of them can process a large number of groups.

Describe potential alternatives A mix of the two approaches can work as well

Additional context Stub of logic to compare the results:

baseline call:

ts.TimeSeries.from_group_dataframe(
                data_df,
                group_cols=grouper,
                value_cols=val,
                time_col="date",
                n_jobs=-1,
            )

a (potentially) significantly faster implementation to compare

def process_group(data_df):
    return ts.TimeSeries.from_group_dataframe(
        data_df,
        group_cols=grouper,
        value_cols=val,
        time_col="date",
        n_jobs=1,
    )

n_chonks = cpu_count()
sub_df = df[grouper].drop_duplicates()
#make a list of dataframes that correspond to each group indexes
list_df = np.array_split(sub_df, n_chonks)
jobs = []
for chunk in list_df:
    #create a sub chunk of the original dataframe using the group index
    chunk_df = df.merge(chunk)
    job = delayed(process_group)(chunk_df) 
    jobs.append(job)
retLst = Parallel(n_jobs=-1)(jobs)
covariates[key] = sum(retLst, start=[])

On my dataset the latter code is about 4x faster for a dataframe with 30k groups.

I am not certain this is worth putting into the library but thought it might be worth looking into

Jan 19 '25 05:01 tRosenflanz

Hi @tRosenflanz. Thanks a lot for this report and potential to speed up from_group_dataframe (and sorry for the late response) 🚀 We have another PR (#2656) ongoing that will add support for metadata to TimeSeries, and will apply some changes to from_group_dataframe.

After it's merged, we can definitely give this ago. I'll add it to our backlog.

Feb 28 '25 10:02 dennisbader