dendextend icon indicating copy to clipboard operation
dendextend copied to clipboard

cutree sort_cluster_numbers

Open jefferis opened this issue 9 years ago • 7 comments

HI Tal, First of all greetings for 2016!

Now a quick question, which I think may imply a bug in the docs or in the cutree function. Something I often want to do is return the cluster ids for all individuals. I often need to return:

  • the individuals in their original (data) order (like stats::cutree)
  • but the cluster ids in dendrogram order
hc <- hclust(dist(USArrests), "ave")
cutree(hc, k=4)
plot(hc)

What I want can be achieved like this:

cutree(hc, k=4, order_clusters_as_data = F)[hc$labels]

But I can't seem to achieve it with any permutation of cutree arguments

# individuals in data order, cluster ids like stats::cutree
cutree(hc, k=4, order_clusters_as_data = T, sort_cluster_numbers = T)

# individuals in data order, cluster ids still identical to stats::cutree
cutree(hc, k=4, order_clusters_as_data = T, sort_cluster_numbers = F)

# individuals in dendrogram order, cluster ids in dendrogram order
cutree(hc, k=4, order_clusters_as_data = F, sort_cluster_numbers = T)

# individuals in dendrogram order, cluster ids like stats::cutree
cutree(hc, k=4, sort_cluster_numbers = F)
  1. Does this seem like a sensible thing to want to do?
  2. Can it be achieved directly?
  3. Looking at the help for sort_cluster_numbers:

sort_cluster_numbers logical (TRUE). Should the resulting cluster id numbers be sorted? (default is TRUE in order to make the function compatible with cutree ) from stats, but it allows for sensible color order when using color_branches.

But I don't really understand that since

  1. sort_cluster_numbers has no effect when order_clusters_as_data = T
  2. one must set sort_cluster_numbers = F when order_clusters_as_data = F to give the same cluster ids as stats::cutree (the opposite of what I would expect from the help).

Thanks for any insight! Greg Jefferis.

jefferis avatar Jan 02 '16 13:01 jefferis

Dear Greg, First - happy new year to you too :)

I'm glad to see that you are taking interest in the code.

I would indeed consider this is a bug in the documentation. Here are the options for taking this forward:

  1. To remove the option of sort_cluster_numbers all together (and always keep it as TRUE)
  2. To update the doc to explain this behavior (mmm, probably not)
  3. To have this option work the way you want it to (but then comes the question if this helpful to users. Because if your use-case is rare - maybe it should be moved to an external function)

What do you think?

As to your questions:

  1. I am not sure. Why is that important for you? what is the use case? My motivation for these options was because I relied on a code for cutree.dendrogram that wouldn't give the same number ids as stats::cutree, so I need to do more manipulation - which ended up as these parameters.
  2. It can't be (at this point). But let's discuss here if it should be implemented in the function.
  3. You are correct. The document is wrong here, and the real issue is a bit more complex. Background: Since stats::cutree relies on a C code which is faster than the R code, there is a parameter in cutree.dendrogram called try_cutree_hclust which should be used when testing the function (otherwise it just turns it into hclust, if it can, and runs the usual cutree on it). This has "weird" implications. Let's take a small example:
d1 <- USArrests[1:5,]
hc <- hclust(dist(d1), "ave")
dend <- as.dendrogram(hc)
plot(hc)

In the following examples, it would give the same outcome in various configurations:

## sort_cluster_numbers = T
# order_clusters_as_data = T
cutree(hc, k=4, order_clusters_as_data = T, sort_cluster_numbers = T)
cutree(dend, k=4, order_clusters_as_data = T, sort_cluster_numbers = T)
cutree(dend, k=4, order_clusters_as_data = T, sort_cluster_numbers = T, try_cutree_hclust = FALSE)
# order_clusters_as_data = F
cutree(hc, k=4, order_clusters_as_data = F, sort_cluster_numbers = T)
cutree(dend, k=4, order_clusters_as_data = F, sort_cluster_numbers = T)
cutree(dend, k=4, order_clusters_as_data = F, sort_cluster_numbers = T, try_cutree_hclust = FALSE)

However, in the following example, the case in which try_cutree_hclust = FALSE will give a different outcome!

## sort_cluster_numbers = F
# order_clusters_as_data = T
cutree(hc, k=4, order_clusters_as_data = T, sort_cluster_numbers = F)
cutree(dend, k=4, order_clusters_as_data = T, sort_cluster_numbers = F)
cutree(dend, k=4, order_clusters_as_data = T, sort_cluster_numbers = F, try_cutree_hclust = FALSE)
# order_clusters_as_data = F
cutree(hc, k=4, order_clusters_as_data = F, sort_cluster_numbers = F)
cutree(dend, k=4, order_clusters_as_data = F, sort_cluster_numbers = F)
cutree(dend, k=4, order_clusters_as_data = F, sort_cluster_numbers = F, try_cutree_hclust = FALSE)

I am open to your suggestions.

Best, Tal

talgalili avatar Jan 02 '16 16:01 talgalili

Hi Tal,

I think that removing the option is not what I would have done, because I think there are good reasons to want the behaviour that was implied by having this. Quoting from the (still extant) sort_levels_values documentation:

#' This function is useful for \link[dendextend]{cutree} - making the 
#' sort_cluster_numbers parameter possible. Using that parameter with TRUE
#' makes the clusters id's from cutree to be ordered from left to right. 
#' e.g: the left most cluster in the tree will be numbered "1", the one
#' after it will be "2" etc...).

I made an example the other day, but I was debugging a graphics driver computer and managed to crash my machine. Essentially, if one wants to use the cluster ids in conjunction with other plots of the data that will be compared with the dendrogam then one needs exactly what I was saying:

  • the individuals in their original (data) order (like stats::cutree)
  • but the cluster ids in dendrogram order

I think it would be desirable to enable this functionality (it looks very much like it was intended).

Incidentally while helper functions can be a good thing, my experience is that R users often have trouble finding functions to do what they need. Therefore I think this functionality does belong as part of cutree.

Best,

Greg.

jefferis avatar Jan 05 '16 17:01 jefferis

HI Greg, O.k., I will reintroduce this parameter later this week.

Yours, Tal

talgalili avatar Jan 05 '16 20:01 talgalili

Hi Tal,

I was thinking a bit more about this. I'm not sure that the present argument names are very clear. I was trying to think what might be better but I don't know how much change you'd consider.

In any case if you bring them back as is, I would suggest the following default value:

sort_cluster_numbers=!order_clusters_as_data

Best,

Greg.

jefferis avatar Jan 05 '16 22:01 jefferis

Hi Greg, I added you as an author of the package (in the DESCRIPTION), and gave you push privileges to the github repo.

Please feel free to add if(sort_cluster_numbers) ... "cutree(hc, k=4, order_clusters_as_data = F)[hc$labels]" At some point in the end of the cutree.dendrogram, cutree.hclust functions (or use any other parameter name you think make sense). Please only make sure that the default of the functions will give the same outcome as stats::cutree.

Since there would probably be more places to further push this package, I would be happy for your help.

With regards, Tal

talgalili avatar Jan 06 '16 07:01 talgalili

Hi Tal,

Thank you very much for this invitation. When I have something I'll prob still do a PR. I agree that is essential that the function's default behaviour matches stats::cutree. All the best,

Greg.

Sent from my iPhone

On 6 Jan 2016, at 07:54, Tal Galili [email protected] wrote:

Hi Greg, I added you as an author of the package (in the DESCRIPTION), and gave you push privileges to the github repo.

Please feel free to add if(sort_cluster_numbers) ... "cutree(hc, k=4, order_clusters_as_data = F)[hc$labels]" At some point in the end of the cutree.dendrogram, cutree.hclust functions (or use any other parameter name you think make sense). Please only make sure that the default of the functions will give the same outcome as stats::cutree.

Since there would probably be more places to further push this package, I would be happy for your help.

With regards, Tal

— Reply to this email directly or view it on GitHub.

jefferis avatar Jan 07 '16 01:01 jefferis

Great Greg. Looking forward to updated from you, once you'll get around to it.

With regards, Tal

talgalili avatar Jan 07 '16 13:01 talgalili