meryl
meryl copied to clipboard
Homopolymer compression is not applied if the first read file is empty
Running count compress
with multiple read files and an empty file as the first file does not apply homopolymer compression. The following command creates an index without homopolymer compression:
meryl count compress k=21 threads=4 memory=32g empty.fa reads.fa output kmers_withempty
But putting the empty file as the not first file will correctly create a homopolymer compressed index:
meryl count compress k=21 threads=4 memory=32g reads.fa empty.fa output kmers_withempty2
meryl print shows the first file is not homopolymer compressed but the second is:
$ meryl print kmers_withempty/ | head
Found 1 command tree.
PROCESSING TREE #1 using 1 thread.
opLessThan
kmers_withempty/
print to (stdout)
AAAAAAAAAAAAAAAAATAAG 1
AAAAAAAAAAAAAAAACTACA 1
AAAAAAAAAAAAAAAATAAGG 1
AAAAAAAAAAAAAAACAATAC 1
AAAAAAAAAAAAAAACTACAG 1
AAAAAAAAAAAAAAATAAGGA 1
AAAAAAAAAAAAAACAATACT 1
AAAAAAAAAAAAAACTACAGA 1
AAAAAAAAAAAAAATAAGGAG 1
AAAAAAAAAAAAAAGTACTTT 1
$ meryl print kmers_withempty2 | head
Found 1 command tree.
PROCESSING TREE #1 using 1 thread.
opLessThan
kmers_withempty2/
print to (stdout)
ACACACACACACACACTACTA 1
ACACACACACACACTACTACT 1
ACACACACACACATCATATAC 1
ACACACACACACTACAGACAT 1
ACACACACACACTACAGATCA 1
ACACACACACACTACTACTAC 2
ACACACACACATCATATACAG 1
ACACACACACTACAGACATCA 1
ACACACACACTACAGATCATC 1
ACACACACACTACTACTACTA 4
$ meryl --version
meryl snapshot v1.4-development +29 changes (r969 97d5923dd69ebc3efed67fc466c21ed8c5e6670b)
Thanks, Mikko. It's not just an empty first file that causes trouble. The 'compress' flag is reset after EACH file. The workaround is simple but annoying: add 'compress' before each input file.
I remember debating if this flag should be reset or not. I'm a little embarrassed I left it in.