phangorn icon indicating copy to clipboard operation
phangorn copied to clipboard

Is the alignment size limited?

Open thkuo opened this issue 6 years ago • 4 comments

Is it possible to know the maximum size of alignment that read.phyDat(), phyDat() or as.phyDat() can parse?

thkuo avatar Oct 08 '17 20:10 thkuo

Hi @thkuo,
I hand coded no explicit size restrictions into phangorn like PhyML. So I guess (I haven't really check huge data sets) it is the limit of the R vector (or how I treat them). It should be roughly 2Gb or ~2 billion bases. I did not make any changes yet (but it is now on my TODO list) as Emmanuel did for ape (see http://ape-package.ird.fr/NEWS):

"DNAbin" objects larger than 2.1 billion bases (exactly 2^31-1) can now be handled by most functions in ape. The limitations are the same than in base R: 2^52 (~4.5*10^15) elements, and the numbers of rows and columns of matrices cannot exceed 2^31-1. read.dna() can now read FASTA files larger than 2.1 gigabytes. Two functions are still limited to 2.1 Gb: dist.dna and DNAbin2indel (the limit applies to the product of the numbers of sequences by the sequence length since they work with matrices).

It seems you handle large data (and R does not crash) or other phangorn functions which take a long time it would be nice if you could profile (Rprof, summaryRprof) the code and send the results to me. I guess there are quite a few things which could get optimized (for speed and memory).

Rprof(tmp <- tempfile(), memory.profiling = TRUE)
# some phangorn code e.g.
read.phyDat(large.file)
Rprof()
summaryRprof(tmp, memory = "both")
unlink(tmp)

KlausVigo avatar Oct 09 '17 20:10 KlausVigo

Dear Dr. Schliep,

Many thanks for the reply. My alignment includes roughly 500 sequences and 330k columns, which looks smaller than 2 billion bases. Let me check some details more carefully and then send the report if some technical support is still needed.

On Oct 9, 2017, 22:00, at 22:00, Klaus Schliep [email protected] wrote:

Hi @thkuo,
I hand coded no explicit size restrictions into phangorn like PhyML. So I guess (I haven't really check huge data sets) it is the limit of the R vector (or how I treat them). It should be roughly 2Gb or ~2 billion bases. I did not make any changes yet (but it is now on my TODO list) as Emmanuel did for ape (see http://ape-package.ird.fr/NEWS):

"DNAbin" objects larger than 2.1 billion bases (exactly 2^31-1) can now be handled by most functions in ape. The limitations are the same than in base R: 2^52 (~4.5*10^15) elements, and the numbers of rows and columns of matrices cannot exceed 2^31-1. read.dna() can now read FASTA files larger than 2.1 gigabytes. Two functions are still limited to 2.1 Gb: dist.dna and DNAbin2indel (the limit applies to the product of the numbers of sequences by the sequence length since they work with matrices).

It seems you handle large data (and R does not crash) or other phangorn functions which take a long time it would be nice if you could profile (Rprof, summaryRprof) the code and send the results to me. I guess there are quite a few things which could get optimized (for speed and memory).

Rprof(tmp <- tempfile(), memory.profiling = TRUE)
# some phangorn code e.g.
read.phyDat(large.file)
Rprof()
summaryRprof(tmp, memory = "both")
unlink(tmp)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/KlausVigo/phangorn/issues/33#issuecomment-335270016

thkuo avatar Oct 09 '17 20:10 thkuo

@thkuo, in this case it should be possible to read, or at least with only minor changes. read.phyDat uses internally read.dna to read. the data and transforms them with phyDat afterwards. Now phyDat stores only unique site (column) patterns. As you have only ~500 sequences the compression should be pretty good (there should be lots of columns which are all 'A', 'C', 'G' or 'T') or do you have SNP data? So check first if read.dna works (it should). If phyDat fails on the whole data set we could just process junks and add them afterwards. Is it possible to send me the data set?

KlausVigo avatar Oct 09 '17 21:10 KlausVigo

@KlausVigo, many thanks for the support. Your suggestion just reminded me to check the residue states in my data, and I just found that my nucleotide alignment includes N's, which were not taken as ambiguous state and possibly caused the loss of some columns. I just updated the contrast matrix and it works well for now (although optimizing the MP tree seems to take a lot of time).

thkuo avatar Oct 10 '17 09:10 thkuo