phyx pxcomp outputs an enormous file, or rejects unaligned sequences.

Hi all,

I compiled phyx (commit b359a6d, 2022/08/19) on Ubuntu 20.04. The tests from make check all pass.

I am trying to compute the Chi square test of compositional homogeneity with pxcomp from fasta input, but it fails in different manners:

nucleotides, aligned, ungapped:

 echo -e '>a\nACGTACAAGTCCATT\n>b\nACCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out

This succeeds, but the output file is 4.1G ! It contains some gigantic lines and we have to be careful not to crash the computer while trying to open it:

 sed -n 'l80' test_pxcomp.out | head -n 20

        Observed character counts:$
             A            C            G            T        Nchar$
a            5            4            2            4           15$
b            5            4            3            3           15$
Total                                                                          \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \
                                                                               \

So this file is filled with 4G of space characters I suppose.

nucleotides, aligned, with gap characters

 echo -e '>a\n-ACGTACAAGTCCATT\n>b\nA-CCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out

output file is also 4.1G.

nucleotides, unaligned, no gap
```
 echo -e '>a\nACGTA\n>b\nACCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out
```
Fails with Error: sequences must be aligned. Exiting.. However for this kind of test, there is no need to have aligned sequences. The average alignment frequencies can be scaled to the given sequence size, is this implemented this way?
amino-acids, aligned, no gap:
```
 echo -e '>a\nABCDEFGHIKLMNPQRSTVWXYZ\n>b\nAAADEFGHIKLMNPQRSTVWZZZ' | pxcomp -o test_pxcomp.out
```
Fails with Error: missing/ambiguous characters not currently supported. Exiting.

Any ideas for fixing these problems, especially the file size problem?

Feb 09 '23 13:02 Gullumluvl

Yikes! I will look at it. Thanks for the issue.

Feb 09 '23 13:02 josephwb

Regarding 3, that was a choice to go with aligned sequences. I'll think about relaxing this constraint. But an easy way around this at the moment is to just do a (poor) alignment first.

Feb 09 '23 13:02 josephwb

@Gullumluvl The bug is very stupid and regards formatting the output. It assumes that the taxon labels are a certain length; since yours are only 1 character, it doesn't know what to do and dies. I will fix this shortly. Sorry for the hassle.

Feb 09 '23 15:02 josephwb

@Gullumluvl should be fixed as of 0d0e7b9.

Please note this is a rough-and-dirty test of compositional heterogeneity. Better (tree-based) tests are available in p4 (and PAUP, I think).

Feb 09 '23 15:02 josephwb

Great, thank you! Ah yes maybe my dummy input was too dummy... The real dataset I tried first was failing because it's amino-acid, I guess? I wasn't aware of p4, thanks!

Feb 09 '23 20:02 Gullumluvl

@Gullumluvl Hmm not sure why it is not working on amino acids. It is working on binary data. I will take another look.

Feb 09 '23 21:02 josephwb

phyx phyx copied to clipboard

pxcomp outputs an enormous file, or rejects unaligned sequences.

phyx
phyx copied to clipboard