dython Parallelize dython.nominal.associations

I am trying out Dython for the first time and as written on the homepage I am using:

from dython.nominal import associations
associations(data)

However this has been running for several minutes now (data is around 300x1700) and from htop it looks like it only uses one of my 10/20 (real/hyperthreading) CPU cores. Is it possible to parallelize this to increase the speed?

Feb 19 '22 14:02 KonradHoeffner

Hey @KonradHoeffner - parallelizing would be a great idea, and contributions would be more than welcomed!

Feb 20 '22 22:02 shakedzy

@shakedzy: Thanks for the invitations! Unfortunately I have zero experience with both parallelism and library development in Python.

I still cloned the repository to take a look if there is an easy change to make and installed both the normal and dev dependencies, however the tests failed because dython was not installed. I then installed dython via pip but I am not sure if that is working as intended, as then the tests would test the version installed from pip and not the one locally, is that right? It would be helpful to have some more documentation on how to setup and test the project for developers in either README.md or CONTRIBUTING.md, but maybe that is only needed for people not experienced with Python library development like me who probably can't contribute much anyways.

Feb 22 '22 10:02 KonradHoeffner

Hey @KonradHoeffner - please open a bug report and add all this there, so we won't make different issues here.

Anyway, the best way to run and test the version you have installed is by cloning the repo, and then from within the dython version run:

pip install -e .

Feb 22 '22 11:02 shakedzy

OK, I moved the bug report from here into two new issues https://github.com/shakedzy/dython/issues/119 and https://github.com/shakedzy/dython/issues/118.

Feb 22 '22 11:02 KonradHoeffner

I worked on this for one of my pro projects. I have broken down the association functions into smaller functions and parallelized them. It is currently a bit less flexible (no option yet for user-defined functions, but that could be easy to fix) but it allows us to weigh all the association/correlation measures. On my machine (old laptop, 8 cores), it ran 3x faster than the original function.

Would you be interested @shakedzy in another PR? My approach is different from @mahieyin-rahmun.

the components, Theil's U, etc, are the ones being parallelized (parallelization is not done in the association function)
weighted version of the measures (this can be made vanilla, without weighting the observation)
using pandas (pandas is fast ^^), the na strategies using lists are a huge bottleneck
the overall is closer to PEP standards and easier to read (flat is better than nested)
plotting the association matrix would be done using another function (following PEP, smaller functions are preferred)

Sep 02 '22 09:09 ThomasBury

Hey @ThomasBury - first of all, thanks for your contribution :) I'm wondering here if @mahieyin-rahmun thumbs-up means he's into your code too :) I think adding your parallelization of the components functions is a good idea no matter what, as these are often used directly, and not through associations. If there a an easy fix for allowing user-defined functions, then why not giving it a shot? Perhaps we can test both your version and @mahieyin-rahmun 's version, to see which one works best. What do you think? If you do decide to open a PR, please mind the comments I gave @mahieyin-rahmun, regarding having only one associations method, and enabling multiprocessing using a flag.

Sep 03 '22 20:09 shakedzy

Yeah, I think it would be interesting to see both versions, and if a combination of both versions (multiprocessing on top of parallelized component functions) can give an even better performance gain.

Sep 04 '22 04:09 mahieyin-rahmun

Hey, I'm late on the ball. I'm currently refactoring the code when my other duties allow me to do so. I'll try to benchmark the same way. The other features are working also fine. I'll provide an example as illustration when I'll open the pr (beginning of next week normally). That being said, great job 👍

Sep 14 '22 19:09 ThomasBury

@ThomasBury: I adapted the benchmark code from @mahieyin-rahmun to the parameter renaming and to generic core count:

from dython.nominal import associations
import numpy as np
import pandas as pd
import time
import random
import matplotlib.pyplot as plt

random.seed(42)
np.random.seed(42)

MAX_CORES = 16
RUNS = 5
NUM_COLS = 86

def get_random_categorical_array():
    categorical_data_array = [
        ["A", "B", "C"],
        ["W", "X", "Y", "Z"],
        ["Yes", "No"],
        ["M", "N", "O", "P"]
    ]
    return random.choice(categorical_data_array)


def benchmark_function(func, dataframe, **kwargs):
    times = []

    for run_count in range(RUNS):
        print(f"Run: {run_count}")
        start = time.time()
        results = func(dataframe, compute_only=True, **kwargs)
        end = time.time()
        print(f"Time taken: {(end - start):.2f} seconds")
        times.append(end - start)

    print(
        f"Average time taken by {func}: {(sum(times) / len(times)):.2f} seconds")

    return times


def run_benchmark():
    bar_chart_data = np.random.randn(32001, NUM_COLS)

    dataframe = pd.DataFrame(
        data=bar_chart_data[1:, 1:],
        index=bar_chart_data[1:, 0],
        columns=bar_chart_data[0, 1:]
    )

    # 75% of the columns are nominal
    for index in range(int(NUM_COLS * 0.75)):
        dataframe.iloc[:, index] = np.random.choice(
            get_random_categorical_array(), len(dataframe))

    times_vanialla = benchmark_function(
        associations, dataframe, multiprocessing=False)
    times_cores = dict()
    for i in range(2,MAX_CORES+1,2):
        times_cores[i] = benchmark_function(associations, dataframe, multiprocessing=True, max_cpu_cores=i)

    # plot the results
    fig, axes = plt.subplots(1, 2, figsize=(12, 6), dpi=150)
    axes[0].plot(times_vanialla, label="Vanilla")
    for i in range(2,MAX_CORES+1,2):
        axes[0].plot(times_cores[i], label=str(i)+" cores")
    axes[0].set_title("Time taken by associations function (5 runs)")
    axes[0].set_xlabel("Run count")
    axes[0].set_ylabel("Time taken (seconds)")
    axes[0].set_xticks(np.arange(len(times_vanialla)),
                       np.arange(1, len(times_vanialla)+1))
    axes[0].legend()

    bar_chart_data = {
        "Vanilla": np.mean(times_vanialla),
    }
    for i in range(2,MAX_CORES+1,2):
        bar_chart_data[str(i)+" cores"] = np.mean(times_cores[i])
    axes[1].bar(bar_chart_data.keys(), bar_chart_data.values(), width=0.5)
    axes[1].set_title(
        "Average time taken by associations function (across 5 runs)")
    axes[1].set_ylabel("Time taken (seconds)")
    plt.tight_layout()
    plt.show()
    plt.savefig('benchmark.svg', bbox_inches='tight')

if __name__ == "__main__":
    run_benchmark()

Sep 15 '22 06:09 KonradHoeffner

@ThomasBury & @KonradHoeffner - no worries and no pressure :) whenever you have the time, simply open a new PR.

Sep 15 '22 09:09 shakedzy

dython dython copied to clipboard

Parallelize dython.nominal.associations

dython
dython copied to clipboard