Feature suggestion: Color gradient
I'm currently making scatter plots and I would like my data to be colored accordingly to the probability scores (0.0-1.0). Clusterfun is interpreting my "probability" column as many distinct classes producing a very colorful but not awfully helpful graph.
The desired behavior would be for clusterfun to realise it's not dealing with distinct classes passed in the "color" column but instead a range of values and color them accordingly (e.g. cmap = plt.get_cmap(colormap, num_bins) from matplotlib).
Here's a function that can help figure out if we're dealing with distinct categories or a range of values:
`
def categorical_or_distribution(arr, threshold_ratio=0.2):
"""
Determine whether the given array represents distinct categories or a distribution.
Parameters:
- arr (list or numpy array): Array to classify.
- threshold_ratio (float): Threshold ratio to distinguish between categorical and distribution.
Returns:
- str: "categories" if the data represents distinct categories, "distribution" otherwise.
"""
# Convert input to a numpy array if not already
arr = np.array(arr)
# Get the number of unique values and the total count
unique_values, counts = np.unique(arr, return_counts=True)
num_unique = len(unique_values)
total_values = len(arr)
# Calculate the ratio of unique values to total values
unique_ratio = num_unique / total_values
# Determine if the data is categorical or a distribution
if unique_ratio <= threshold_ratio:
return "categorical"
else:
return "distribution"
# Further analysis using entropy (optional)
# I don't really want to add scipy as a dependency tho
#value_entropy = entropy(counts)
# Check if the entropy suggests a distribution (high entropy)
#if value_entropy > np.log2(num_unique):
# return "distribution"
#else:
# return "categories"
`
And here's a simple function that can map binned values to the "coolwarm" color scheme: `
def data_to_color_mapping(data, colormap='coolwarm', num_bins=256):
"""
Assigns colors to data points based on a gradient colormap.
Parameters:
- data (list or numpy array): Array of data points to color.
- colormap (str): Name of the colormap to use.
- num_bins (int): Number of bins (shades) to use in the colormap.
Returns:
- dict: Mapping of data point to its corresponding color.
"""
# Convert data to a numpy array
data = np.array(data)
# Normalize data to the range [0, 1]
data_min, data_max = np.min(data), np.max(data)
normalized_data = (data - data_min) / (data_max - data_min)
# Create a colormap object
cmap = plt.get_cmap(colormap, num_bins)
# Apply colormap to the normalized data
colors = [rgba_to_hex(cmap(value)) for value in normalized_data]
# Create a mapping of data point to color
data_color_map = {data_point: color for data_point, color in zip(data, colors)}
return data_color_map
def rgba_to_hex(rgba):
"""
Converts RGBA color to HEX.
Parameters:
- rgba (tuple): RGBA color as a tuple.
Returns:
- str: Color as a HEX string.
"""
return '#{:02x}{:02x}{:02x}'.format(int(rgba[0] * 255), int(rgba[1] * 255), int(rgba[2] * 255))
`
Currently clusterfun expects to work with classes - the function clusterfun.storage.local.data.get_data_per_color() was built around that assumption, so it's not obvious how to integrate a color range without changing this function.
Do you see an easy way to add this functionality?
Hi @piernikowyludek, thanks for the suggestion, that's definitely a useful one. clusterfun uses plotlyJS for rendering the plot in the frontend and I think they should have some options for this. I won't have time today but can have a look tomorrow to see if I can add it.
@piernikowyludek I added a color_is_categorical boolean to the histogram, bar chart and scatter options.
I thought about determining this automatically, similar to what you suggested, but I think it is quite problem dependent - as also highlighted here: https://stackoverflow.com/a/54801198.
For now, I'm defaulting to the Viridis color scale, might add an option to set this if you think this is useful (see all color scales here).
I released a new version 0.3.1a7 and the color_is_categorical should be included there.
I hope this helps and I understood correctly what you wanted, if not, let me know and I can try to change it.
See also https://github.com/gietema/clusterfun/releases/tag/v0.3.1 for an example