phangorn
phangorn copied to clipboard
Is the alignment size limited?
Is it possible to know the maximum size of alignment that read.phyDat(), phyDat() or as.phyDat() can parse?
Hi @thkuo,
I hand coded no explicit size restrictions into phangorn like PhyML. So I guess (I haven't really check huge data sets) it is the limit of the R vector (or how I treat them). It should be roughly 2Gb or ~2 billion bases.
I did not make any changes yet (but it is now on my TODO list) as Emmanuel did for ape (see http://ape-package.ird.fr/NEWS):
"DNAbin" objects larger than 2.1 billion bases (exactly 2^31-1) can now be handled by most functions in ape. The limitations are the same than in base R: 2^52 (~4.5*10^15) elements, and the numbers of rows and columns of matrices cannot exceed 2^31-1. read.dna() can now read FASTA files larger than 2.1 gigabytes. Two functions are still limited to 2.1 Gb: dist.dna and DNAbin2indel (the limit applies to the product of the numbers of sequences by the sequence length since they work with matrices).
It seems you handle large data (and R does not crash) or other phangorn functions which take a long time it would be nice if you could profile (Rprof
, summaryRprof
) the code and send the results to me. I guess there are quite a few things which could get optimized (for speed and memory).
Rprof(tmp <- tempfile(), memory.profiling = TRUE)
# some phangorn code e.g.
read.phyDat(large.file)
Rprof()
summaryRprof(tmp, memory = "both")
unlink(tmp)
Dear Dr. Schliep,
Many thanks for the reply. My alignment includes roughly 500 sequences and 330k columns, which looks smaller than 2 billion bases. Let me check some details more carefully and then send the report if some technical support is still needed.
On Oct 9, 2017, 22:00, at 22:00, Klaus Schliep [email protected] wrote:
Hi @thkuo,
I hand coded no explicit size restrictions into phangorn like PhyML. So I guess (I haven't really check huge data sets) it is the limit of the R vector (or how I treat them). It should be roughly 2Gb or ~2 billion bases. I did not make any changes yet (but it is now on my TODO list) as Emmanuel did for ape (see http://ape-package.ird.fr/NEWS):"DNAbin" objects larger than 2.1 billion bases (exactly 2^31-1) can now be handled by most functions in ape. The limitations are the same than in base R: 2^52 (~4.5*10^15) elements, and the numbers of rows and columns of matrices cannot exceed 2^31-1. read.dna() can now read FASTA files larger than 2.1 gigabytes. Two functions are still limited to 2.1 Gb: dist.dna and DNAbin2indel (the limit applies to the product of the numbers of sequences by the sequence length since they work with matrices).
It seems you handle large data (and R does not crash) or other phangorn functions which take a long time it would be nice if you could profile (
Rprof
,summaryRprof
) the code and send the results to me. I guess there are quite a few things which could get optimized (for speed and memory).Rprof(tmp <- tempfile(), memory.profiling = TRUE) # some phangorn code e.g. read.phyDat(large.file) Rprof() summaryRprof(tmp, memory = "both") unlink(tmp)
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/KlausVigo/phangorn/issues/33#issuecomment-335270016
@thkuo, in this case it should be possible to read, or at least with only minor changes.
read.phyDat
uses internally read.dna
to read. the data and transforms them with phyDat
afterwards. Now phyDat
stores only unique site (column) patterns. As you have only ~500 sequences
the compression should be pretty good (there should be lots of columns which are all 'A', 'C', 'G' or 'T') or do you have SNP data? So check first if read.dna
works (it should).
If phyDat
fails on the whole data set we could just process junks and add them afterwards. Is it possible to send me the data set?
@KlausVigo, many thanks for the support. Your suggestion just reminded me to check the residue states in my data, and I just found that my nucleotide alignment includes N's, which were not taken as ambiguous state and possibly caused the loss of some columns. I just updated the contrast matrix and it works well for now (although optimizing the MP tree seems to take a lot of time).