phangorn icon indicating copy to clipboard operation
phangorn copied to clipboard

Phangorn distance functions crash R with segmentation fault when alignment contains missing values.

Open ejurga opened this issue 3 years ago • 2 comments

When a sequence in a DNAbin or phyDat alignment contains a missing value in place of a nucleotide, and the alignment is processed with a distance function from phangorn (e.g., dist.ml), R crashes from a sementation fault.

On linux (Ubuntu 18.04.5) R exits with the following error:

Process was terminated

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I understand that technically, an alignment probably shouldn't contain missing values, but the function vcfR::vcfR2DNAbin does output some NA's. Either way, these are valid values in a phyDat object and a DNAbin object, so probably phangorn's distance metrics should be able to handle them. At the very least, the behaviour is really annoying, and was pretty difficult to get to the root cause of.

Here is a MWE that is reproducable on my machine.

# Just make a sample sequence
make_seq <- function(size=size){
  nucl <- c('A', 'T', 'G', 'C')
  sample(nucl, replace = TRUE, size=size)
}

# Make a DNAbin object
fake_alignment <- function(size, n=10, add_na=FALSE){

  l = list()
  for (i in c(1:n)){
    seq <- c(make_seq(size=size))
    if (add_na==TRUE){ # Add an NA to the end of the alignment
      seq <- c(seq, NA)
    }
    l[[i]] <- seq
    }
  m <- do.call('rbind', l)
  ape::as.DNAbin(m)
}

# This works!
dna <- fake_alignment(n=2, size=1000, add_na=FALSE)
dist <- phangorn::dist.ml(dna)

# Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
dna_with_na <- fake_alignment(n=2, size=1000, add_na=TRUE)
dist <- phangorn::dist.ml(dna_with_na

ejurga avatar Feb 17 '22 19:02 ejurga

Hi @ejurga, can you send me a short example with vcfR::vcfR2DNAbin code which produces this error? What would be the proper or expected handling of NA's? Removing or coding as "?" or "-" ? Thanks, Klaus

KlausVigo avatar Feb 18 '22 15:02 KlausVigo

Hi, This is a bug in ape: the conversion from "character" to "DNAbin" does not consider that some values could be NA. You can fix this bug 'by hand' since as.DNAbin.character() is not hidden. So, do the usual fix(as.DNAbin.character), then insert this line:

x[is.na(x)] <- "?"

right at the top of the body. Save and close. Another possibility is to do this manipulation before calling as.DNAbin(). Best, Emmanuel

emmanuelparadis avatar Apr 05 '22 13:04 emmanuelparadis