phyx icon indicating copy to clipboard operation
phyx copied to clipboard

pxcomp outputs an enormous file, or rejects unaligned sequences.

Open Gullumluvl opened this issue 2 years ago • 6 comments

Hi all,

I compiled phyx (commit b359a6d, 2022/08/19) on Ubuntu 20.04. The tests from make check all pass.

I am trying to compute the Chi square test of compositional homogeneity with pxcomp from fasta input, but it fails in different manners:

  1. nucleotides, aligned, ungapped:

     echo -e '>a\nACGTACAAGTCCATT\n>b\nACCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out
    

    This succeeds, but the output file is 4.1G ! It contains some gigantic lines and we have to be careful not to crash the computer while trying to open it:

     sed -n 'l80' test_pxcomp.out | head -n 20
    
            Observed character counts:$
                 A            C            G            T        Nchar$
    a            5            4            2            4           15$
    b            5            4            3            3           15$
    Total                                                                          \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
    

    So this file is filled with 4G of space characters I suppose.

  2. nucleotides, aligned, with gap characters

     echo -e '>a\n-ACGTACAAGTCCATT\n>b\nA-CCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out
    

    output file is also 4.1G.

  3. nucleotides, unaligned, no gap

     echo -e '>a\nACGTA\n>b\nACCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out
    

    Fails with Error: sequences must be aligned. Exiting.. However for this kind of test, there is no need to have aligned sequences. The average alignment frequencies can be scaled to the given sequence size, is this implemented this way?

  4. amino-acids, aligned, no gap:

     echo -e '>a\nABCDEFGHIKLMNPQRSTVWXYZ\n>b\nAAADEFGHIKLMNPQRSTVWZZZ' | pxcomp -o test_pxcomp.out
    

    Fails with Error: missing/ambiguous characters not currently supported. Exiting.

Any ideas for fixing these problems, especially the file size problem?

Gullumluvl avatar Feb 09 '23 13:02 Gullumluvl

Yikes! I will look at it. Thanks for the issue.

josephwb avatar Feb 09 '23 13:02 josephwb

Regarding 3, that was a choice to go with aligned sequences. I'll think about relaxing this constraint. But an easy way around this at the moment is to just do a (poor) alignment first.

josephwb avatar Feb 09 '23 13:02 josephwb

@Gullumluvl The bug is very stupid and regards formatting the output. It assumes that the taxon labels are a certain length; since yours are only 1 character, it doesn't know what to do and dies. I will fix this shortly. Sorry for the hassle.

josephwb avatar Feb 09 '23 15:02 josephwb

@Gullumluvl should be fixed as of 0d0e7b9.

Please note this is a rough-and-dirty test of compositional heterogeneity. Better (tree-based) tests are available in p4 (and PAUP, I think).

josephwb avatar Feb 09 '23 15:02 josephwb

Great, thank you! Ah yes maybe my dummy input was too dummy... The real dataset I tried first was failing because it's amino-acid, I guess? I wasn't aware of p4, thanks!

Gullumluvl avatar Feb 09 '23 20:02 Gullumluvl

@Gullumluvl Hmm not sure why it is not working on amino acids. It is working on binary data. I will take another look.

josephwb avatar Feb 09 '23 21:02 josephwb