clusterfun icon indicating copy to clipboard operation
clusterfun copied to clipboard

Feature suggestion: Color gradient

Open piernikowyludek opened this issue 1 year ago • 2 comments

I'm currently making scatter plots and I would like my data to be colored accordingly to the probability scores (0.0-1.0). Clusterfun is interpreting my "probability" column as many distinct classes producing a very colorful but not awfully helpful graph.

The desired behavior would be for clusterfun to realise it's not dealing with distinct classes passed in the "color" column but instead a range of values and color them accordingly (e.g. cmap = plt.get_cmap(colormap, num_bins) from matplotlib).

Here's a function that can help figure out if we're dealing with distinct categories or a range of values:

`

def categorical_or_distribution(arr, threshold_ratio=0.2):

"""
Determine whether the given array represents distinct categories or a distribution.

Parameters:
- arr (list or numpy array): Array to classify.
- threshold_ratio (float): Threshold ratio to distinguish between categorical and distribution.

Returns:
- str: "categories" if the data represents distinct categories, "distribution" otherwise.
"""
# Convert input to a numpy array if not already
arr = np.array(arr)

# Get the number of unique values and the total count
unique_values, counts = np.unique(arr, return_counts=True)
num_unique = len(unique_values)
total_values = len(arr)

# Calculate the ratio of unique values to total values
unique_ratio = num_unique / total_values

# Determine if the data is categorical or a distribution
if unique_ratio <= threshold_ratio:
    return "categorical"
else:
    return "distribution"
# Further analysis using entropy (optional) 
# I don't really want to add scipy as a dependency tho
#value_entropy = entropy(counts)

# Check if the entropy suggests a distribution (high entropy)
#if value_entropy > np.log2(num_unique):
#    return "distribution"
#else:
#    return "categories" 

`

And here's a simple function that can map binned values to the "coolwarm" color scheme: `

def data_to_color_mapping(data, colormap='coolwarm', num_bins=256):
"""
Assigns colors to data points based on a gradient colormap.

Parameters:
- data (list or numpy array): Array of data points to color.
- colormap (str): Name of the colormap to use.
- num_bins (int): Number of bins (shades) to use in the colormap.

Returns:
- dict: Mapping of data point to its corresponding color.
"""
# Convert data to a numpy array
data = np.array(data)

# Normalize data to the range [0, 1]
data_min, data_max = np.min(data), np.max(data)
normalized_data = (data - data_min) / (data_max - data_min)

# Create a colormap object
cmap = plt.get_cmap(colormap, num_bins)

# Apply colormap to the normalized data
colors = [rgba_to_hex(cmap(value)) for value in normalized_data]

# Create a mapping of data point to color
data_color_map = {data_point: color for data_point, color in zip(data, colors)}

return data_color_map


def rgba_to_hex(rgba):
"""
Converts RGBA color to HEX.

Parameters:
- rgba (tuple): RGBA color as a tuple.

Returns:
- str: Color as a HEX string.
"""
return '#{:02x}{:02x}{:02x}'.format(int(rgba[0] * 255), int(rgba[1] * 255), int(rgba[2] * 255))

`

Currently clusterfun expects to work with classes - the function clusterfun.storage.local.data.get_data_per_color() was built around that assumption, so it's not obvious how to integrate a color range without changing this function.

Do you see an easy way to add this functionality?

piernikowyludek avatar May 08 '24 12:05 piernikowyludek

Hi @piernikowyludek, thanks for the suggestion, that's definitely a useful one. clusterfun uses plotlyJS for rendering the plot in the frontend and I think they should have some options for this. I won't have time today but can have a look tomorrow to see if I can add it.

gietema avatar May 08 '24 15:05 gietema

@piernikowyludek I added a color_is_categorical boolean to the histogram, bar chart and scatter options.

I thought about determining this automatically, similar to what you suggested, but I think it is quite problem dependent - as also highlighted here: https://stackoverflow.com/a/54801198.

For now, I'm defaulting to the Viridis color scale, might add an option to set this if you think this is useful (see all color scales here).

I released a new version 0.3.1a7 and the color_is_categorical should be included there.

I hope this helps and I understood correctly what you wanted, if not, let me know and I can try to change it.

See also https://github.com/gietema/clusterfun/releases/tag/v0.3.1 for an example

gietema avatar May 09 '24 22:05 gietema