CategoricalArrays.jl
CategoricalArrays.jl copied to clipboard
Plotting categorical values as colors
I have plot recipes that try to plot the categorical values as colors in a geographic map. For example, the crop type in this plot: https://juliaearth.github.io/GeoStats.jl/stable/workflow.html#Plotting-solutions
I was doing some manual pre-processing by depending on CategoricalArrays.jl and converting to a vector of level codes manually. Given that the latest CategoricalArrays.jl support plot recipes already, I wanted to stop doing this manual fix. What is the appropriate method to pass categorical arrays to be interpreted as colors with an appropriate legend containing the levels?
I tried to pass the categorical array as marker_z -> array
but it didn´t work. Appreciate any help as this is the last issue I need to solve before releasing a new version of the project.
cc: @daschw
Very practically, my question is the following:
How can I plot a scatter of points where colors are levels without depending on CategoricalArrays.jl?
using Plots
using CategoricalArrays
c = categorical([1,2,3])
scatter([1,2,3], [1,2,3], marker_z=c)
Appreciate any help regarding this issue. Maybe it should be moved to Plots.jl?
I think you need to wait for @nalimilan to have time to look at the issue, as he implemented the recipe AFAICT.
I have no idea, I just copied the definition provided by @daschw. Maybe @mkborregaard could help too?
Any help would be great. The issue is specific to the maker_z
option. I can use categorical arrays in heatmaps for example.
I guess I will have to revert the changes in downstream projects in order to release? Who is leading Plots.jl nowadays? Should the Julia community pay a software engineer to maintain and fix these issues? It is really hard to progress otherwise.
I will start a thread on Discourse to see what people think about starting a group to split the payment of a salary to a free lancer.
Unfortunately, Recipes only apply to input data and not to attributes like marker_z
. This would require major changes and additions in RecipesBase.jl and RecipesPipeline.jl, which I don't have the capacity to tackle right now. However PRs are very welcome.
If I try to run your example I get the following error:
julia> scatter([1,2,3], [1,2,3], marker_z=c)
Error showing value of type Plots.Plot{Plots.GRBackend}:
ERROR: MethodError: no method matching get(::ColorSchemes.ColorScheme{Vector{RGBA{Float64}}, String, String}, ::CategoricalValue{Int64, UInt32}, ::Tuple{Float64, Float64})
Closest candidates are:
get(::CategoricalPool, ::Any, ::Any) at /home/dani/.julia/packages/CategoricalArrays/rDwMt/src/pool.jl:55
get(::DataStructures.RobinDict{K, V}, ::Any, ::Any) where {K, V} at /home/dani/.julia/packages/DataStructures/ixwFs/src/robin_dict.jl:384
get(::Test.GenericDict, ::Any, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1663
...
Stacktrace:
[1] get(::PlotUtils.ContinuousColorGradient, ::CategoricalValue{Int64, UInt32}, ::Tuple{Float64, Float64})
@ PlotUtils ~/.julia/packages/PlotUtils/es5pb/src/colorschemes.jl:18
So it seems like there is no method for get(colorscheme, categoricalvalue, range)
. Having a look at https://github.com/JuliaGraphics/ColorSchemes.jl/blob/a50d23f9ba76bc4811fbebd94dd309d6750abe6d/src/ColorSchemes.jl#L234 I see that they only implement methods for AllowedInput
, which apparently is Union{Real, AbstractArray{<:Real}}
. So it seems that CategoricalValue{Int}
is not a subtype of Real
.
I tried to overcome this by loosening the type restrictions for get
in Colorschemes
locally. However, then I run into the following error:
julia> scatter([1,2,3], [1,2,3], marker_z=c)
Error showing value of type Plots.Plot{Plots.GRBackend}:
ERROR: ArgumentError: cannot compare a `CategoricalValue` to value `v` of type `CategoricalValue{Int64, UInt32}`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first
Stacktrace:
[1] <(x::CategoricalValue{Int64, UInt32}, y::Float64)
@ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:176
[2] <(y::Float64, x::CategoricalValue{Int64, UInt32})
@ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:180
[3] >(x::CategoricalValue{Int64, UInt32}, y::Float64)
@ Base ./operators.jl:305
[4] clamp(x::CategoricalValue{Int64, UInt32}, lo::Float64, hi::Float64)
@ Base.Math ./math.jl:65
[5] _broadcast_getindex_evalf
@ ./broadcast.jl:648 [inlined]
[6] _broadcast_getindex
@ ./broadcast.jl:621 [inlined]
[7] getindex
@ ./broadcast.jl:575 [inlined]
[8] copy
@ ./broadcast.jl:898 [inlined]
[9] materialize
@ ./broadcast.jl:883 [inlined]
[10] get(cscheme::ColorSchemes.ColorScheme{Vector{RGBA{Float64}}, String, String}, x::CategoricalValue{Int64, UInt32}, rangescale::Tuple{Float64, Float64})
@ ColorSchemes ~/.julia/dev/ColorSchemes/src/ColorSchemes.jl:240
I'm not really sure what we in Plots can do here without depending on CategoricalArrays.
Would you reconsider the dependency on CategoricalArrays.jl? Without an explicit treatment of categorical variables we won't be able to generate correct legend elements for nominal/ordered variables for example.
Without an explicit treatment of categorical variables we won't be able to generate correct legend elements for nominal/ordered variables for example.
Agreed. Also given that CategoricalArrays.jl is now compiler friendly and Plots.jl is compiler-heavy anyway I would vote to add this integration. An alternative would be to add appropriate methods to DataAPI.jl and use the interface.
The methods that error really look like they need a Real
value, so it's not surprising that they fail for CategoricalValue
and I don't see how they could work with them. Maybe what's needed is rather a fallback method for non-Real
types that would attribute a real value to each unique value?
@juliohm How do you expect categorical values to be translated to colors? Should that be equivalent to passing the result of levelcode(v)
?
I was referring to the general approach. In this case (the mapping @juliohm wants) - I do not know enough about the problem to informatively comment.
@juliohm How do you expect categorical values to be translated to colors? Should that be equivalent to passing the result of
levelcode(v)
?
Or a method that maximizes the visual discrepancy between the categories. When the notion of order is important, it could produce a sequential colormap for example. Think of a cloropleth maps like this one: https://en.wikipedia.org/wiki/Choropleth_map#/media/File:Countries_by_mean_wealth_per_adult_in_2018.png
The approach I mentioned would definitely work if you ignore the order and just choose a qualitative palette. To take into account order, the implementation would have to find a reliable way of checking whether the input values are ordered or not (see https://github.com/JuliaData/DataAPI.jl/pull/26).
I'm not sure whether it should be Plots or ColorSchemes' job to do this, but probably worth filing an issue in one of these packages?
The marker_z
attribute in Plots is a mapping from numerical values to colors in a ColorGradient
or ColorScheme
. So this only makes sense for <: Real
values. I would not really know right now, how to handle this generally for CategoricalValues if they are not numeric.
We could take the CategoricalArrays dependency in Plots, as @bkamins suggested, if it is compiler-friendly now and fallback to levelcode. However, automatic matching with the legend entries in Plots is probably more involved, than adding this fallback.
@juliohm why do you want to remove the CategoricalArrays dependency from your recipes?
Actually, I think markercolor
would be the appropriate attribute here. This accepts vectors of any input that specifies colors or vectors of Int
. In the latter case elements from the current ColorPalette
palette are chosen via getindex
. This also fails for CategoricalArrays right now, but I think it better translates to the idea behind CategoricalValues.
@juliohm why do you want to remove the CategoricalArrays dependency from your recipes?
I assumed that the plot recipes provided in CategoricalArrays.jl + Plots.jl would be the official way moving forward. So any package interested in plotting categorical variables would just assume it works out of the box and would forward a CategoricalArray to the Plots.jl pipeline.
If you use marker_z
you pick colors from a "continuous" color scale. In that case you get a colorbar and a single legend entry, because you are plotting a single series:
scatter([1,2,3], [1,2,3], marker_z=[1, 2, 3])
If you use markercolor
you pick colors from a "discrete/categorical" color scale. You still get a single legend entry because you plot a single series:
scatter([1,2,3], [1,2,3], markercolor=[1, 2, 3])
I suppose (I might be wrong here) you want different legend entries for different categorical values. For this you have to group your input into multiple series. This already works for CategoricalArrays with group
:
scatter([1,2,3], [1,2,3], group=categorical(["A", "B", "C"]))
Anyway, I think this is not at all a CategoricalArrays issue and can be closed here. @juliohm if you want we can continue the discussion in a Plots issue.
From what I understand the problem now is about figuring out automatically which attribute to set depending on the vector type. If the vector of values is a vector of Number then we should use maker_z
otherwise we should use group
for categorical arrays. Is that a correct statement @daschw ?
The problem remains because in order to differentiate between the two cases we need access to the categorical array type. So if Plots.jl could take CategoricalArrays.jl as a dependency, we could have a single attribute type for "color of markers" that would do the correct thing internally. Does it make sense?
I like that the issue is discussed here because then core maintainers of CategoricalArrays.jl can share their perspectives on a good design. Right now even with the group option (which I am gonna try soon), we need to be able to differentiate normal arrays from categorical arrays in user code.
From what I understand the problem now is about figuring out automatically which attribute to set depending on the vector type. If the vector of values is a vector of Number then we should use maker_z otherwise we should use group for categorical arrays. Is that a correct statement @daschw ?
Actually, I'd argue we (Plots) should not automatically figure out which attribute to set, but use the attribute that is provided by the user. I think the combination of marker_z
and CategoricalArray
does not make sense, because marker_z
only works for numerical values and, the way I see it, CategoricalValue
s are never really numerical - even if they have numerical values. This is also how the current recipe for CategoricalArrays
works. Consider the following example, where two "numerical" CategoricalArray
s are plotted. The axes have numerical labels - the values of the categorical arrays - however, they are not really numerical, as you can see, if you consider the distances between the ticks.
plot(categorical([1, 3, 7], categorical([19, -4, 100])))
Perhaps I didn't explain myself clearly. I am talking about automatic detection between non-categorical arrays and categorical arrays. End users will want to pass colors no matter if their variables are continuous or categorical. Currently they have to figure out by themselves that marker_z
only works for continuous and group
only works for categorical. There is no such thing as color = vector of colors
or markercolor = vector of colors
that works for both cases.
Package writers like myself could add a dependency on CategoricalArrays.jl to implement this basic choice for the user, but I think this would be much more useful in Plots.jl already (specifically plot recipes). Anyone wanting to plot categorical or continuous values could just pass a vector and internally Plots.jl would use the correct attribute.
Why not check whether the value is a Real
, and if not consider that it's categorical? That would also make sense e.g. for strings. Then you don't need to depend on CategoricalArrays.
Unfortunately the group option doesn't work within plot recipes:
using RecipesBase
struct Foo end
@recipe function f(foo::Foo, data)
seriestype --> :scatter
if eltype(data) <: Number
marker_z --> data
colorbar --> true
else
group --> data
end
[Tuple(rand(2)) for i in 1:length(data)]
end
using CategoricalArrays
plot(Foo(), categorical([1,2,3]))
I will go ahead and submit a release with this bug because of pressing deadlines, but it would be nice to see a workaround.
@daschw do you have a solution for the plot recipe situation above?
That is no longer needed, and can be handled in downstream recipes with post-processing.