matplotlib-venn icon indicating copy to clipboard operation
matplotlib-venn copied to clipboard

Design a new layout algorithm

Open konstantint opened this issue 6 years ago • 12 comments

... along with a general system for pluggable layouts. This would address most of issues being reported here regularly (#30, #34).

The main problem with current layout is that it sometimes results in circles positioned so that some of the 7 regions, corresponding to subsets simply is not present. The better layout should make sure each of the regions exists (even if it would misrepresent some of the other general patterns).

There are several approaches for implementing such logic:

  • Applying the current layout, checking whether all the subset regions are present, increasing the weight of the missing regions and repeating until convergence.
  • Formulating the "loss function" of the layout so that it would account both for quality of representation and non-zeroeness of the set areas, and finding the layout using scipy.optimize.

Internally the layout would probably be represented as a function, which, given subset sizes, returns the radii and centers of the circles. Externally the choice of the layout would be specified by the keyword parameter layout passed to the venn3/venn2 function. The value would be either None (default), string ('naive', 'optim', ...) or the function itself.

konstantint avatar Oct 06 '17 13:10 konstantint

Hi, Paul here, you may remember me as the author of matplotlib_venn_wordcloud (issue #29). I had a lazy afternoon and implemented the second approach, i.e. using scipy.optimize. In its current form, it does quite well for "canonical" examples but still fails badly for the edge cases in issues #30 and #34.

Specifically, the cost function likely needs some work. At the moment, it uses the absolute difference, which doesn't work well if some of the areas are much smaller than the others. The options for minimize could also use some experimentation, as I simply re-used some code I had floating around for a vaguely similar task. For these reasons, the code is obviously not PR-ready; however, I am unsure if and when I will have time again, so I thought I better dump it here in case you or anyone else want to pursue this approach further. Licensed under the MIT license or whichever license this repository may adopt in the future.

Figure_1

import numpy as np
import matplotlib.pyplot as plt

from scipy.optimize import minimize
from shapely.geometry import Point

DEBUG = True


def get_radii(a, b, c, ab, bc, ac, abc):
    areas = np.array([
        a + ab + ac + abc,
        b + ab + bc + abc,
        c + ac + bc + abc,
    ])
    return np.sqrt(areas / np.pi)


def get_subset_areas(origins, radii):
    geometries = get_subset_geometries(origins, radii)
    return [geometry.area for geometry in geometries]


def get_subset_geometries(origins, radii):
    a, b, c = [Point(*origin).buffer(radius) for origin, radius in zip(origins, radii)]
    return [
        a.difference(b.union(c)), # A
        b.difference(a.union(c)), # B
        c.difference(b.union(c)), # C
        a.intersection(b).difference(c), # AB
        b.intersection(c).difference(a), # BC
        a.intersection(c).difference(b), # AC
        a.intersection(b).intersection(c), # ABC
    ]


def get_point_on_a_circle(origin, radius, angle):
    """Compute the (x, y) coordinate of a point at a specified angle
    on a circle given by its (x, y) origin and radius."""
    x0, y0 = origin
    x = x0 + radius * np.cos(angle)
    y = y0 + radius * np.sin(angle)
    return np.array([x, y])


def solve_venn3_circles(desired_areas):
    # [A, B, C, AB, BC, AC, ABC]
    desired_areas = np.array(desired_areas)

    radii = get_radii(*desired_areas)

    def cost_function(origins):
        current_areas = get_subset_areas(origins.reshape(-1, 2), radii)
        cost = np.abs(current_areas - desired_areas) # absolute difference; probably not the best choice
        return np.sum(cost)

    # initialize origins such that all rings overlap
    initial_origins = np.array(
        [get_point_on_a_circle(np.zeros((2)), np.min(radii), angle) for angle in np.pi * np.array([0, 2/3, 4/3])]
    )

    result = minimize(
        cost_function,
        initial_origins.flatten(),
        method='SLSQP',
        jac='2-point',
        options=dict(ftol=1e-3, disp=DEBUG),
    )

    if not result.success:
        print("Warning: could not compute circle positions for given subsets.")
        print(f"scipy.optimize.minimize: {result.message}.")

    origins = result.x.reshape((-1, 2))

    if DEBUG:
        print("Desired areas:", desired_areas)
        print("Returned areas:", get_subset_areas(origins, radii))

    return origins, radii


if __name__ == "__main__":

    def test(A, B, C, AB, BC, AC, ABC, ax=None):
        origins, radii = solve_venn3_circles([A, B, C, AB, BC, AC, ABC])

        if ax is None:
            fig, ax = plt.subplots()

        for origin, radius in zip(origins, radii):
            ax.add_patch(plt.Circle(origin, radius, alpha=0.33))
        ax.set_aspect("equal")
        ax.autoscale_view()
        ax.axis("off")
        ax.set_title(f"A={A}, B={B}, C={C},\nAB={AB}, BC={BC}, AC={AC}, ABC={ABC}")

    fig, axes = plt.subplots(2, 3, figsize=(15,10))
    axes = axes.ravel()
    test(1, 1, 1, 0.5, 0.5, 0.5, 0.25, axes[0]) # canonical example
    test(3, 2, 1, 0.5, 0.4, 0.3, 0.2, axes[1]) # different areas
    test(1, 1, 1, 0, 0, 0, 0, axes[2]) # no intersections; not bad, but could be handled better
    test(1, 0, 0, 1, 0, 0, 1, axes[3]) # a is superset to b, b is superset to c
    test(2, 2, 2, 0.5, 0, 0.5, 0, axes[4]) # no BC, and no ABC
    test(2, 2, 2, 0.1, 0.1, 0.1, 0, axes[5]) # no ABC

    # issue #34
    test(167, 7, 25, 41, 174, 171, 51)

    # issue #30
    test(1, 0, 0, 32, 0, 76, 13)
    test(1, 0, 0, 650, 0, 76, 13)
    test(7549417, 15685620, 26018311, 5128906, 301048, 6841264, 2762301)

    plt.show()

paulbrodersen avatar Jan 22 '24 18:01 paulbrodersen

Had an idea for a small improvement on my bike ride home yesterday. This version transforms the desired and actual areas using the relationship y = log(x + 1) before computing the cost as the absolute difference between the (transformed) desired and the (transformed) actual areas. The transformation has the following properties: (1) it is monotonically increasing, i.e. it preserves the order of sizes, (2) it preserves small numbers as x ~ log(x + 1) for small x, but (3) it strongly compresses the range of large numbers. These properties combined let small areas carry a bit more weight in the optimization, resulting in better solutions, at least by eye.

I also improved the layout of the debug info and made some additional cosmetic changes.

import numpy as np

from scipy.optimize import minimize
from shapely.geometry import Point

DEBUG = True


def get_radii(a, b, c, ab, bc, ac, abc):
    areas = np.array([
        a + ab + ac + abc,
        b + ab + bc + abc,
        c + ac + bc + abc,
    ])
    return np.sqrt(areas / np.pi)


def get_subset_areas(origins, radii):
    geometries = get_subset_geometries(origins, radii)
    return np.array([geometry.area for geometry in geometries])


def get_subset_geometries(origins, radii):
    a, b, c = [Point(*origin).buffer(radius) for origin, radius in zip(origins, radii)]
    return [
        a.difference(b.union(c)), # A
        b.difference(a.union(c)), # B
        c.difference(b.union(c)), # C
        a.intersection(b).difference(c), # AB
        b.intersection(c).difference(a), # BC
        a.intersection(c).difference(b), # AC
        a.intersection(b).intersection(c), # ABC
    ]


def get_point_on_a_circle(origin, radius, angle):
    """Compute the (x, y) coordinate of a point at a specified angle
    on a circle given by its (x, y) origin and radius."""
    x0, y0 = origin
    x = x0 + radius * np.cos(angle)
    y = y0 + radius * np.sin(angle)
    return np.array([x, y])


def solve_venn3_circles(desired_areas):
    # [A, B, C, AB, BC, AC, ABC]
    desired_areas = np.array(desired_areas)

    radii = get_radii(*desired_areas)

    def cost_function(origins):
        current_areas = get_subset_areas(origins.reshape(-1, 2), radii)

        # # Option 1: absolute difference
        # # Probably not the best choice, as small areas are often practically ignored.
        # cost = np.abs(current_areas - desired_areas)

        # # Option 2: relative difference
        # # This often results in the optimization routine failing.
        # minimum_area = 1e-2
        # desired_areas[desired_areas < minimum_area] = minimum_area
        # cost = np.abs(current_areas - desired_areas) / desired_areas

        # Option 3: absolute difference of log(area + 1)
        # This transformation is monotonic increasing but strongly compresses large numbers,
        # thus allowing small numbers to carry more weight in the optimization.
        cost = np.abs(np.log(current_areas + 1) - np.log(desired_areas + 1))

        return np.sum(cost)

    # initialize origins such that all rings overlap
    initial_origins = np.array(
        [get_point_on_a_circle(np.zeros((2)), np.min(radii), angle) for angle in np.pi * np.array([0, 2/3, 4/3])]
    )

    result = minimize(
        cost_function,
        initial_origins.flatten(),
        method='SLSQP',
        options=dict(disp=DEBUG),
    )

    if not result.success:
        print("Warning: could not compute circle positions for given subsets.")
        print(f"scipy.optimize.minimize: {result.message}.")

    origins = result.x.reshape((-1, 2))

    if DEBUG:
        import pandas as pd
        data = pd.DataFrame(dict(desired=desired_areas, actual=get_subset_areas(origins, radii)))
        data["absolute difference"] = np.abs(data["desired"] - data["actual"])
        data["relative difference"] = data["absolute difference"] / data["desired"]
        with pd.option_context('display.float_format', '{:0.1f}'.format):
            print(data)

    return origins, radii


if __name__ == "__main__":

    import matplotlib.pyplot as plt

    def test(A, B, C, AB, BC, AC, ABC, ax=None):
        origins, radii = solve_venn3_circles([A, B, C, AB, BC, AC, ABC])

        if ax is None:
            fig, ax = plt.subplots()

        for origin, radius in zip(origins, radii):
            ax.add_patch(plt.Circle(origin, radius, alpha=0.33))
        ax.set_aspect("equal")
        ax.autoscale_view()
        ax.axis("off")
        ax.set_title(f"A={A}, B={B}, C={C},\nAB={AB}, BC={BC}, AC={AC}, ABC={ABC}")

    fig, axes = plt.subplots(2, 3, figsize=(15,10))
    axes = axes.ravel()
    test(1, 1, 1, 0.5, 0.5, 0.5, 0.25, axes[0]) # canonical example
    test(3, 2, 1, 0.5, 0.4, 0.3, 0.2, axes[1]) # different areas
    test(1, 1, 1, 0, 0, 0, 0, axes[2]) # no intersections; not bad, but could be handled better
    test(1, 0, 0, 1, 0, 0, 1, axes[3]) # a is superset to b, b is superset to c
    test(2, 2, 2, 0.5, 0, 0.5, 0, axes[4]) # no BC, and no ABC
    test(2, 2, 2, 0.1, 0.1, 0.1, 0, axes[5]) # no ABC

    # issue #34
    test(167, 7, 25, 41, 174, 171, 51)

    # issue #30
    test(1, 0, 0, 32, 0, 76, 13)
    test(1, 0, 0, 650, 0, 76, 13)
    test(7549417, 15685620, 26018311, 5128906, 301048, 6841264, 2762301)

    plt.show()

paulbrodersen avatar Jan 23 '24 12:01 paulbrodersen

Wow, this is very cool.

Let me try to find time to take it up on the weekend. Besides just the layout implementation this will require:

  • A major version increase.
  • A careful change to the API to allow plugging the new layout algorithm without disrupting any existing usages.
  • A packaging option where Shapely is an opt-in dependency along with clear enough documentation.

konstantint avatar Jan 23 '24 12:01 konstantint

Do have a close look at the code though, too. Four eyes see more than two. For example, I just found a major bug in get_subset_geometries. Also, making a proper test suite with a bunch of examples that evaluates performance along a few metrics (absolute difference, relative difference, etc) would be a good call, I think. Note, that I haven't compared my results to the output of your algorithm at all, yet.

Without the bug, the performance of my optimization is much better. ;-) Issue #30, last test case, gets solved perfectly.

import numpy as np

from scipy.optimize import minimize
from shapely.geometry import Point

DEBUG = True


def get_radii(a, b, c, ab, bc, ac, abc):
    areas = np.array([
        a + ab + ac + abc,
        b + ab + bc + abc,
        c + ac + bc + abc,
    ])
    return np.sqrt(areas / np.pi)


def get_subset_areas(origins, radii):
    geometries = get_subset_geometries(origins, radii)
    return np.array([geometry.area for geometry in geometries])


def get_subset_geometries(origins, radii):
    a, b, c = [get_shapely_circle(origin, radius) for origin, radius in zip(origins, radii)]
    return [
        a.difference(b).difference(c), # A
        b.difference(a).difference(c), # B
        c.difference(a).difference(b), # C
        a.intersection(b).difference(c), # AB
        b.intersection(c).difference(a), # BC
        a.intersection(c).difference(b), # AC
        a.intersection(b).intersection(c), # ABC
    ]


def get_shapely_circle(origin, radius):
    return Point(*origin).buffer(radius)


def get_point_on_a_circle(origin, radius, angle):
    """Compute the (x, y) coordinate of a point at a specified angle
    on a circle given by its (x, y) origin and radius."""
    x0, y0 = origin
    x = x0 + radius * np.cos(angle)
    y = y0 + radius * np.sin(angle)
    return np.array([x, y])


def solve_venn3_circles(desired_areas):
    # [A, B, C, AB, BC, AC, ABC]
    desired_areas = np.array(desired_areas)

    radii = get_radii(*desired_areas)

    def cost_function(origins):
        current_areas = get_subset_areas(origins.reshape(-1, 2), radii)

        # # Option 1: absolute difference
        # # Probably not the best choice, as small areas are often practically ignored.
        # cost = np.abs(current_areas - desired_areas)

        # Option 2: relative difference
        # This often results in the optimization routine failing.
        # minimum_area = 1e-2
        # desired_areas[desired_areas < minimum_area] = minimum_area
        # cost = np.abs(current_areas - desired_areas) / desired_areas

        # Option 3: absolute difference of log(area + 1)
        # This transformation is monotonic increasing but strongly compresses large numbers,
        # thus allowing small numbers to carry more weight in the optimization.
        cost = np.abs(np.log(current_areas + 1) - np.log(desired_areas + 1))

        return np.sum(cost)

    # initialize origins such that all rings overlap
    initial_origins = np.array(
        [get_point_on_a_circle(np.zeros((2)), np.min(radii), angle) for angle in np.pi * np.array([0, 2/3, 4/3])]
    )

    result = minimize(
        cost_function,
        initial_origins.flatten(),
        method='SLSQP',
        options=dict(disp=DEBUG, ftol=1e-6),
    )

    if not result.success:
        print("Warning: could not compute circle positions for given subsets.")
        print(f"scipy.optimize.minimize: {result.message}.")

    origins = result.x.reshape((-1, 2))

    if DEBUG:
        import pandas as pd
        data = pd.DataFrame(dict(desired=desired_areas, actual=get_subset_areas(origins, radii)))
        data["absolute difference"] = np.abs(data["desired"] - data["actual"])
        data["relative difference"] = data["absolute difference"] / data["desired"]
        with pd.option_context('display.float_format', '{:0.1f}'.format):
            print(data)

    return origins, radii


if __name__ == "__main__":

    import matplotlib.pyplot as plt

    def test(A, B, C, AB, BC, AC, ABC, ax=None):
        origins, radii = solve_venn3_circles([A, B, C, AB, BC, AC, ABC])

        if ax is None:
            fig, ax = plt.subplots()

        for origin, radius in zip(origins, radii):
            ax.add_patch(plt.Circle(origin, radius, alpha=0.33))
        ax.set_aspect("equal")
        ax.autoscale_view()
        ax.axis("off")
        ax.set_title(f"A={A}, B={B}, C={C},\nAB={AB}, BC={BC}, AC={AC}, ABC={ABC}")

    fig, axes = plt.subplots(2, 3, figsize=(15,10))
    axes = axes.ravel()
    test(1, 1, 1, 0.5, 0.5, 0.5, 0.25, axes[0]) # canonical example
    test(3, 2, 1, 0.5, 0.4, 0.3, 0.2, axes[1]) # different areas
    test(1, 1, 1, 0, 0, 0, 0, axes[2]) # no intersections; not bad, but could be handled better
    test(1, 0, 0, 1, 0, 0, 1, axes[3]) # a is superset to b, b is superset to c
    test(2, 2, 2, 0.5, 0, 0.5, 0, axes[4]) # no BC, and no ABC
    test(2, 2, 2, 0.1, 0.1, 0.1, 0, axes[5]) # no ABC

    # issue #30
    test(1, 0, 0, 32, 0, 76, 13)
    test(1, 0, 0, 650, 0, 76, 13)
    test(7549417, 15685620, 26018311, 5128906, 301048, 6841264, 2762301) # works well with latest version

    # issue #34
    test(167, 7, 25, 41, 174, 171, 51)

    plt.show()

paulbrodersen avatar Jan 23 '24 12:01 paulbrodersen

Obviously I won't just copy-paste your code somewhere :)

konstantint avatar Jan 23 '24 12:01 konstantint

Regarding dependencies:

A packaging option where Shapely and Scipy are opt-in dependencies along with clear enough documentation.

There is no way around scipy.optimize, but shapely is only used to compute the subset areas. I assumed that you were doing that somewhere within your code base, but I couldn't find it within a hot minute, so I decided to roll my own using shapely (I did mention I was feeling lazy yesterday). Even if there isn't anything in your code base, yet, computing circle areas sounds like something that has closed form solutions that should be easily implemented with a bit of numpy arithmetic.

paulbrodersen avatar Jan 23 '24 12:01 paulbrodersen

Also, eventually, it might be worth posting a question on SO, possibly with a bounty (happy to chip in), to see if anyone can find a better a cost function or minimization procedure. It's the right sort of problem (self-contained yet of practical use) that might attract the right people.

paulbrodersen avatar Jan 23 '24 12:01 paulbrodersen

I added some constraints to the optimisation, which did make the results more visually pleasing. I also played around with the initialisation, albeit with no success.

Figure_1

import numpy as np

from scipy.spatial.distance import cdist, pdist, squareform
from scipy.optimize import minimize, NonlinearConstraint

from shapely.geometry import Point

DEBUG = True


def get_radii(a, b, c, ab, bc, ac, abc):
    areas = np.array([
        a + ab + ac + abc,
        b + ab + bc + abc,
        c + ac + bc + abc,
    ])
    return np.sqrt(areas / np.pi)


def get_subset_areas(origins, radii):
    geometries = get_subset_geometries(origins, radii)
    return np.array([geometry.area for geometry in geometries])


def get_subset_geometries(origins, radii):
    a, b, c = [get_shapely_circle(origin, radius) for origin, radius in zip(origins, radii)]
    return [
        a.difference(b).difference(c), # A
        b.difference(a).difference(c), # B
        c.difference(a).difference(b), # C
        a.intersection(b).difference(c), # AB
        b.intersection(c).difference(a), # BC
        a.intersection(c).difference(b), # AC
        a.intersection(b).intersection(c), # ABC
    ]


def get_shapely_circle(origin, radius):
    return Point(*origin).buffer(radius)


def initialize_origins(radii):
    """Initialize origins on a small circle around (0, 0)."""
    origin = np.zeros((2))
    radius = np.min(radii)
    angles = np.pi * np.array([0, 2/3, 4/3])
    return np.array(
        [get_point_on_a_circle(origin, radius, angle) for angle in angles]
    )


def get_point_on_a_circle(origin, radius, angle):
    """Compute the (x, y) coordinate of a point at a specified angle
    on a circle given by its (x, y) origin and radius."""
    x0, y0 = origin
    x = x0 + radius * np.cos(angle)
    y = y0 + radius * np.sin(angle)
    return np.array([x, y])


# # Doesn't work as well as I had hoped.
# def initialize_origins(radii):
#     """Initialize circle origins such that all circles (pairwise) overlap.

#     The optimization uses gradient descent. If certain subset areas
#     are zero, there is no gradient to follow. This procedure
#     guarantees the existence of all subsections apart from region ABC.
#     However, in all examples tested, ABC did indeed exist as well.

#     """
#     ra, rb, rc = radii

#     # choose the side lengths such that circles overlap, but not fully
#     ab = ra + rb - min(ra, rb) / 2
#     bc = rb + rc - min(rb, rc) / 2
#     ac = ra + rc - min(ra, rc) / 2

#     # find corner coordinates
#     # https://math.stackexchange.com/a/1989113/439191
#     a = np.array([0, 0])
#     b = np.array([0, ab])
#     cy = (ab**2 + ac**2 - bc**2) / (2 * ab)
#     cx = np.sqrt(ac**2 - cy**2)
#     c = np.array([cx, cy])
#     return np.array([a, b, c])


def solve_venn3_circles(desired_areas):
    # [A, B, C, AB, BC, AC, ABC]
    desired_areas = np.array(desired_areas)

    radii = get_radii(*desired_areas)

    def cost_function(origins):
        current_areas = get_subset_areas(origins.reshape(-1, 2), radii)

        # # Option 1: absolute difference
        # # Probably not the best choice, as small areas are often practically ignored.
        # cost = np.abs(current_areas - desired_areas)

        # # Option 2: relative difference
        # # This often results in the optimization routine failing.
        # minimum_area = 1e-2
        # desired_areas[desired_areas < minimum_area] = minimum_area
        # cost = np.abs(current_areas - desired_areas) / desired_areas

        # Option 3: absolute difference of log(area + 1)
        # This transformation is monotonic increasing but strongly compresses large numbers,
        # thus allowing small numbers to carry more weight in the optimization.
        cost = np.abs(np.log(current_areas + 1) - np.log(desired_areas + 1))

        return np.sum(cost)

    # constraints:
    eps = np.min(radii) * 0.01
    lower_bounds = np.abs(radii[np.newaxis, :] - radii[:, np.newaxis]) - eps
    lower_bounds[lower_bounds < 0] = 0
    lower_bounds = squareform(lower_bounds)

    upper_bounds = radii[np.newaxis, :] + radii[:, np.newaxis] + eps
    upper_bounds -= np.diag(np.diag(upper_bounds)) # squareform requires zeros on diagonal
    upper_bounds = squareform(upper_bounds)

    def constraint_function(origins):
        origins = np.reshape(origins, (-1, 2))
        return pdist(origins)

    distance_between_origins = NonlinearConstraint(
        constraint_function, lb=lower_bounds, ub=upper_bounds)

    result = minimize(
        cost_function,
        initialize_origins(radii).flatten(),
        method='SLSQP',# 'COBYLA',
        constraints=[distance_between_origins],
        options=dict(disp=DEBUG)
    )

    if not result.success:
        print("Warning: could not compute circle positions for given subsets.")
        print(f"scipy.optimize.minimize: {result.message}.")

    origins = result.x.reshape((-1, 2))

    if DEBUG:
        import pandas as pd
        data = pd.DataFrame(dict(desired=desired_areas, actual=get_subset_areas(origins, radii)))
        data["absolute difference"] = np.abs(data["desired"] - data["actual"])
        data["relative difference"] = data["absolute difference"] / data["desired"]
        with pd.option_context('display.float_format', '{:0.1f}'.format):
            print(data)

    return origins, radii


if __name__ == "__main__":

    import matplotlib.pyplot as plt

    def test(A, B, C, AB, BC, AC, ABC, ax=None):
        origins, radii = solve_venn3_circles([A, B, C, AB, BC, AC, ABC])

        if ax is None:
            fig, ax = plt.subplots()

        for origin, radius in zip(origins, radii):
            ax.add_patch(plt.Circle(origin, radius, alpha=0.33))
        ax.set_aspect("equal")
        ax.autoscale_view()
        ax.axis("off")
        ax.set_title(f"A={A}, B={B}, C={C},\nAB={AB}, BC={BC}, AC={AC}, ABC={ABC}")

    fig, axes = plt.subplots(2, 3, figsize=(15,10))
    axes = axes.ravel()
    test(1, 1, 1, 0.5, 0.5, 0.5, 0.25, axes[0]) # canonical example
    test(3, 2, 1, 0.5, 0.4, 0.3, 0.2, axes[1]) # different areas
    test(1, 1, 1, 0, 0, 0, 0, axes[2]) # no intersections; not bad, but could be handled better
    test(1, 0, 0, 1, 0, 0, 1, axes[3]) # a is superset to b, b is superset to c
    test(2, 2, 2, 0.5, 0, 0.5, 0, axes[4]) # no BC, and no ABC
    test(2, 2, 2, 0.1, 0.1, 0.1, 0, axes[5]) # no ABC

    # # issue #30
    # test(1, 0, 0, 32, 0, 76, 13)
    # test(1, 0, 0, 650, 0, 76, 13)
    # test(7549417, 15685620, 26018311, 5128906, 301048, 6841264, 2762301) # works well with latest version

    # # issue #34
    # test(167, 7, 25, 41, 174, 171, 51)

    plt.show()

paulbrodersen avatar Jan 23 '24 16:01 paulbrodersen

An argument in favour of making shapely a core dependency would be that is has an implementation of polylabel, which can be used to find the point of inaccessibility. I think that would improve the label placement.

Taking the example from issue #50 :

Figure_1

import numpy as np

from scipy.spatial.distance import cdist, pdist, squareform
from scipy.optimize import minimize, NonlinearConstraint

from shapely.geometry import Point
from shapely.ops import polylabel

DEBUG = True


def get_radii(a, b, c, ab, bc, ac, abc):
    areas = np.array([
        a + ab + ac + abc,
        b + ab + bc + abc,
        c + ac + bc + abc,
    ])
    return np.sqrt(areas / np.pi)


def get_subset_areas(origins, radii):
    geometries = get_subset_geometries(origins, radii)
    return np.array([geometry.area for geometry in geometries])


def get_subset_geometries(origins, radii):
    a, b, c = [get_shapely_circle(origin, radius) for origin, radius in zip(origins, radii)]
    return [
        a.difference(b).difference(c), # A
        b.difference(a).difference(c), # B
        c.difference(a).difference(b), # C
        a.intersection(b).difference(c), # AB
        b.intersection(c).difference(a), # BC
        a.intersection(c).difference(b), # AC
        a.intersection(b).intersection(c), # ABC
    ]


def get_shapely_circle(origin, radius):
    return Point(*origin).buffer(radius)


def initialize_origins(radii):
    """Initialize origins on a small circle around (0, 0)."""
    origin = np.zeros((2))
    radius = np.min(radii)
    angles = np.pi * np.array([0, 2/3, 4/3])
    return np.array(
        [get_point_on_a_circle(origin, radius, angle) for angle in angles]
    )


def get_point_on_a_circle(origin, radius, angle):
    """Compute the (x, y) coordinate of a point at a specified angle
    on a circle given by its (x, y) origin and radius."""
    x0, y0 = origin
    x = x0 + radius * np.cos(angle)
    y = y0 + radius * np.sin(angle)
    return np.array([x, y])


# # Doesn't work as well as I had hoped.
# def initialize_origins(radii):
#     """Initialize circle origins such that all circles (pairwise) overlap.

#     The optimization uses gradient descent. If certain subset areas
#     are zero, there is no gradient to follow. This procedure
#     guarantees the existence of all subsections apart from region ABC.
#     However, in all examples tested, ABC did indeed exist as well.

#     """
#     ra, rb, rc = radii

#     # choose the side lengths such that circles overlap, but not fully
#     ab = ra + rb - min(ra, rb) / 2
#     bc = rb + rc - min(rb, rc) / 2
#     ac = ra + rc - min(ra, rc) / 2

#     # find corner coordinates
#     # https://math.stackexchange.com/a/1989113/439191
#     a = np.array([0, 0])
#     b = np.array([0, ab])
#     cy = (ab**2 + ac**2 - bc**2) / (2 * ab)
#     cx = np.sqrt(ac**2 - cy**2)
#     c = np.array([cx, cy])
#     return np.array([a, b, c])


def solve_venn3_circles(desired_areas):
    # [A, B, C, AB, BC, AC, ABC]
    desired_areas = np.array(desired_areas)

    radii = get_radii(*desired_areas)

    def cost_function(origins):
        current_areas = get_subset_areas(origins.reshape(-1, 2), radii)

        # # Option 1: absolute difference
        # # Probably not the best choice, as small areas are often practically ignored.
        # cost = np.abs(current_areas - desired_areas)

        # # Option 2: relative difference
        # # This often results in the optimization routine failing.
        # minimum_area = 1e-2
        # desired_areas[desired_areas < minimum_area] = minimum_area
        # cost = np.abs(current_areas - desired_areas) / desired_areas

        # Option 3: absolute difference of log(area + 1)
        # This transformation is monotonic increasing but strongly compresses large numbers,
        # thus allowing small numbers to carry more weight in the optimization.
        cost = np.abs(np.log(current_areas + 1) - np.log(desired_areas + 1))

        return np.sum(cost)

    # constraints:
    eps = np.min(radii) * 0.01
    lower_bounds = np.abs(radii[np.newaxis, :] - radii[:, np.newaxis]) - eps
    lower_bounds[lower_bounds < 0] = 0
    lower_bounds = squareform(lower_bounds)

    upper_bounds = radii[np.newaxis, :] + radii[:, np.newaxis] + eps
    upper_bounds -= np.diag(np.diag(upper_bounds)) # squareform requires zeros on diagonal
    upper_bounds = squareform(upper_bounds)

    def constraint_function(origins):
        origins = np.reshape(origins, (-1, 2))
        return pdist(origins)

    distance_between_origins = NonlinearConstraint(
        constraint_function, lb=lower_bounds, ub=upper_bounds)

    result = minimize(
        cost_function,
        initialize_origins(radii).flatten(),
        method='SLSQP',# 'COBYLA',
        constraints=[distance_between_origins],
        options=dict(disp=DEBUG)
    )

    if not result.success:
        print("Warning: could not compute circle positions for given subsets.")
        print(f"scipy.optimize.minimize: {result.message}.")

    origins = result.x.reshape((-1, 2))

    if DEBUG:
        import pandas as pd
        data = pd.DataFrame(dict(desired=desired_areas, actual=get_subset_areas(origins, radii)))
        data["absolute difference"] = np.abs(data["desired"] - data["actual"])
        data["relative difference"] = data["absolute difference"] / data["desired"]
        with pd.option_context('display.float_format', '{:0.1f}'.format):
            print(data)

    return origins, radii


def get_label_positions(origins, radii, labels):
    "For each non-zero area, find the point of inaccesibility."
    geometries = get_subset_geometries(origins, radii)
    output = list()
    for ii, (label, geometry) in enumerate(zip(labels, geometries)):
        if geometry.area > 0:
            poi = polylabel(geometry)
            output.append((poi.x, poi.y, label))
    return output


if __name__ == "__main__":

    import matplotlib.pyplot as plt

    def test(A, B, C, AB, BC, AC, ABC, ax=None):
        origins, radii = solve_venn3_circles([A, B, C, AB, BC, AC, ABC])

        if ax is None:
            fig, ax = plt.subplots()

        for origin, radius in zip(origins, radii):
            ax.add_patch(plt.Circle(origin, radius, alpha=0.33))
        ax.set_aspect("equal")
        ax.autoscale_view()
        ax.axis("off")
        # ax.set_title(f"A={A}, B={B}, C={C},\nAB={AB}, BC={BC}, AC={AC}, ABC={ABC}")

        label_positions = get_label_positions(origins, radii, [A, B, C, AB, BC, AC, ABC])
        for (x, y, label) in label_positions:
            ax.text(x, y, label, va="center", ha="center")

    # fig, axes = plt.subplots(2, 3, figsize=(15,10))
    # axes = axes.ravel()
    # test(1, 1, 1, 0.5, 0.5, 0.5, 0.25, axes[0]) # canonical example
    # test(3, 2, 1, 0.5, 0.4, 0.3, 0.2, axes[1]) # different areas
    # test(1, 1, 1, 0, 0, 0, 0, axes[2]) # no intersections; not bad, but could be handled better
    # test(1, 0, 0, 1, 0, 0, 1, axes[3]) # a is superset to b, b is superset to c
    # test(2, 2, 2, 0.5, 0, 0.5, 0, axes[4]) # no BC, and no ABC
    # test(2, 2, 2, 0.1, 0.1, 0.1, 0, axes[5]) # no ABC

    # # issue #30
    # test(1, 0, 0, 32, 0, 76, 13)
    # test(1, 0, 0, 650, 0, 76, 13)
    # test(7549417, 15685620, 26018311, 5128906, 301048, 6841264, 2762301) # works well with latest version

    # # issue #34
    # test(167, 7, 25, 41, 174, 171, 51)

    # issue # 50
    test(1808, 1181, 1858, 4715, 3031, 26482, 65012)
    plt.show()

paulbrodersen avatar Jan 23 '24 17:01 paulbrodersen

So, I haven't found enough time this weekend to look at the particular algorithm proposals above. What I did manage, though, is to refactor the code and expose the layout algorithm for customization.

After the latest commit it is possible to do the following:

  • Create a custom VennLayoutAlgorithm implementation.
  • Pass it as layout_algorithm=custom_algorithm to venn2/venn3/venn2_circles/venn3_circles.

Could you try packaging the best of your experiments above into such a class to make it easy to try out? The current architecture lends reasonably well to inclusion of such experimental layout methods (perhaps under an appropriately-named package) into the codebase without breaking any normal usages.

konstantint avatar Jan 29 '24 23:01 konstantint

Sorry, I had a cold and am behind on work so I won't have time for additional contributions right now. Also, I think it is pretty important that you run some tests first, to see if this version performs better than your original implementation. Until that is for certain, it doesn't really make sense to spend more time on this code.

Furthermore, if it does indeed perform better, you may want to think about some further refactoring. The code should pretty easily generalize to arbitrary Venn diagrams (2, 3, 4, etc), so you may want to make more significant changes to the API.

paulbrodersen avatar Feb 01 '24 11:02 paulbrodersen

Well, we're not in a hurry here, and as you might notice I myself also find less time for this project than it might deserve :)

I will certainly keep looking at it as I find the time gradually. Just made sure there are interfaces where you could also contribute constructively if you happen to get the time.

pretty easily generalize to arbitrary Venn diagrams (2, 3, 4, etc), so you may want to make more significant changes to the API.

Yes, I saw that and I do think that it would be nice to have a vennx function for this sort of arbitrariness. The only change the API might need is to use Dict[code, size] as the primary representation of region sizes rather than the current Tuple[size, size, size, ...].

konstantint avatar Feb 01 '24 14:02 konstantint