ggalluvial icon indicating copy to clipboard operation
ggalluvial copied to clipboard

improve spread, gather error message: Computation failed in `stat_*()`: Each row of output must be identified by a unique combination of keys.

Open guangingmai opened this issue 4 years ago • 10 comments

I want to run ggalluvial in barplot. But it have some warning message, when i run the following code. Dose anyone know how to fix it?

p <- ggplot(data = physeq_phylum, aes(x=sampleid, y=Abundance, alluvium = Phylum, stratum = Phylum))
p + geom_alluvium(aes(fill = Phylum), alpha = .5, width = .6) + 
  geom_stratum(aes(fill = Phylum), width = .6) 

Warning message:

## Warning message:
## Computation failed in `stat_alluvium()`:
## Each row of output must be identified by a unique combination of keys.
## Keys are shared for 8 rows:
## * 5, 6
## * 31, 32
## * 51, 52
## * 60, 61

guangingmai avatar Apr 16 '20 09:04 guangingmai

Hi @guangingmai, thanks for raising the issue. It's difficult to know exactly what the problem is without a reproducible example. Would you be able to share a subset of the data you're using that produces the same error? Check out the reprex package for how to generate an example.

The error message comes from tidyr::spread(). It is not the most informative, but it has been discussed in this issue thread. Probably you can resolve it by creating a new column of unique row IDs in the data set and passing this new column to alluvium. (Phylum would still be passed to stratum.)

Please let me know if this doesn't help!

corybrunson avatar Apr 17 '20 16:04 corybrunson

First of all, Thanks for your reply. I reshaped my dataset, and i found that the dataset with two same row IDs of one group in one column cannot work, but it can work on only if the unique row IDs of one group in one column. Why the former cannot work?

guangingmai avatar Apr 18 '20 09:04 guangingmai

I'm glad you've found a solution, at least! I don't know what the columns contain, so i can't be sure why it works when another doesn't. If you can't share your entire data set, see if you can boil it down to a small data set that hits the same problem and that you can share.

corybrunson avatar Apr 18 '20 10:04 corybrunson

You can try my code where the dataset is stored on the website.

data <- read.table('dataset.txt', header=T)
p <- ggplot(data = data, aes(x=Sample, y=Abundance, alluvium = Phylum, stratum = Phylum))
(p1 <- p + geom_alluvium(aes(fill = Phylum), alpha = .5, width = .6) + 
  geom_stratum(aes(fill = Phylum), width = .6)) 

Warning output:

## Warning message:
## Computation failed in `stat_alluvium()`:
## Each row of output must be identified by a unique combination of keys.
## Keys are shared for 3 rows:
## * 4, 5, 6

guangingmai avatar Apr 18 '20 12:04 guangingmai

Could you say in more detail what sort of plot you're trying to produce? Most alluvial plots require three aesthetic specs: x (position along the horizontal axis), stratum (value in the stacked bar chart at each x value), and alluvium (identifier that links these position–value pairs for the same subject or observation). It looks like you've created two stacked bar plots, one for each sample—something that could be done with geom_bar(). What do you want the flows between them to represent?

corybrunson avatar Apr 18 '20 18:04 corybrunson

I hope it's okay if I piggyback here. I am trying to do a similar thing over a timecourse. I have multiple days and (for the reprex) multiple US states reporting some value (pct), but not every state reports every day, so there aren't always alluvia going between consecutive days. I've discovered that something about the shape of the data determines whether this fails or not, but I can't determine what, since the error message about duplicated rows is either misleading, or referring to the data in an in-between stage that is not exposed to me.

The difference between the plots below is just the sampling to generate the fake data. The second plot is exactly the output desired.

library(reprex)
#> Warning: package 'reprex' was built under R version 3.6.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 3.6.2
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.3
library(ggalluvial)

set.seed(123) # fails
fake_tmp <- data.frame(rowname = 1:20,
                       date = c("Day 1", "Day 2", "Day 3", "Day 4", "Day 5"),
                       pct = rnorm(20, mean = 5, sd = 2),
                       gene = sample(state.abb[1:20], 20, replace = TRUE))
tmp2 <- fake_tmp %>%
  gather(key, stratum, -rowname, -date, -pct)

ggplot(tmp2, aes(x = date, 
                 y = pct,
                 stratum = stratum,
                 alluvium = stratum)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum(aes(fill = stratum)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
#> Warning: Computation failed in `stat_alluvium()`:
#> Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 7, 8


# Error refers to rows 7 & 8
tmp2[7:8,]
#>   rowname  date      pct  key stratum
#> 7       7 Day 2 5.921832 gene      GA
#> 8       8 Day 3 2.469878 gene      CT


set.seed(464) # succeeds
fake_tmp <- data.frame(rowname = 1:20,
                       date = c("Day 1", "Day 2", "Day 3", "Day 4", "Day 5"),
                       pct = rnorm(20, mean = 5, sd = 2),
                       gene = sample(state.abb[1:20], 20, replace = TRUE))
tmp2 <- fake_tmp %>%
  gather(key, stratum, -rowname, -date, -pct)

ggplot(tmp2, aes(x = date, 
                       y = pct,
                       stratum = stratum,
                       alluvium = stratum)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum(aes(fill = stratum)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Created on 2020-04-22 by the reprex package (v0.3.0)

mfoos avatar Apr 22 '20 15:04 mfoos

@mfoos absolutely fine. Thanks for bringing it up.

First, an apology: I have not yet learned how to produce the intelligent and informative warning and error messages of other packages, in particular ggplot2 and its tidyverse siblings. I should probably create an issue and invite help on that.

The error message that identifies rows 7 and 8 in your first example was spit out by tidyr::spread(), which is used internally by to_alluvia_form(), which is in turn used by StatAlluvium$compute_panel(). By the time it's used, though, the data set has been reordered, so the row numbers in the message don't correspond to those of the input data set. It turns out that they refer to two rows with the same values of date and stratum. That is, one state has been measured twice for the same axis. You can identify these directly with this line:

count(tmp2, date, stratum)

Please check back if this doesn't resolve the issue. I'll at least have the next version check for this sort of problem and throw an error earlier, since i still run into the same issue from time to time.

corybrunson avatar Apr 22 '20 16:04 corybrunson

awesome awesome awesome, this is super helpful, thank you!

mfoos avatar Apr 22 '20 16:04 mfoos

@corybrunson The spread function is "Retired lifecycle".

Quote: "Development on spread() is complete, and for new code we recommend switching to pivot_wider()"

Andreas-Bio avatar May 03 '20 17:05 Andreas-Bio

@andzandz11 thanks for mentioning this. A future major release, probably the one after next, will indeed replace gather() and spread() with pivot_longer() and pivot_wider(). The switch is underway, and the release will include some new features that the switch enables; check out the pivot and pivot-params branches if you're interested. Meanwhile, the retired functions will remain exported in tidyr, so i haven't bumped the switch up to the next release (the devel and devel-parsimony branches).

corybrunson avatar May 03 '20 19:05 corybrunson