PhyloCSFpp icon indicating copy to clipboard operation
PhyloCSFpp copied to clipboard

Error when running mmseqs createsubdb: sh: 1: Syntax error: ")" unexpected

Open marcasriv opened this issue 3 years ago • 7 comments

Hi,

I'm interested in running PhyloCSF++ with annotate-with-mmseqs on Chinese hamster, but I am getting an error when it reaches the mmseqs createsubdb step:

./phylocsf++ annotate-with-mmseqs --threads 35 --output conservation species.txt 58mammals criGri1.refGene.gtf

Checking whether MMseqs2 is installed ... Processing GFF /mnt/HDD2/conservation/criGri1.refGene.gtf Created the genomesDB directory. Created the cds directory. Reading reference genome of GFF file /mnt/HDD2/conservation/fastas/criGri1.fa ... Reading GFF file and extracting CDS coordinates ... MMseqs2: Indexing genomes ... MMseqs Version: 42bf6438fec1e1b987f46d8f6d4b09926ecfc019 Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 1 Offset of numeric ids 0 Compressed 0 Verbosity 3

Converting sequences [410465] 1m 2s 307ms Time for merging to genbankseqs_h: 0h 0m 0s 74ms Time for merging to genbankseqs: 0h 0m 43s 532ms Database type: Nucleotide Time for processing: 0h 1m 46s 799ms bash -c $'mmseqs createsubdb <(awk '$3 == 0' /mnt/HDD2/conservation//genomesDB/genbankseqs.lookup) conservation//genomesDB/genbankseqs /mnt/HDD2/conservation//genomesDB/genbankseqs_0' sh: 1: Syntax error: ")" unexpected

This is how the input species.txt file looks like:

chinese_hamster conservation/fastas/criGri1.fa mouse conservation/fastas/Mus_musculus.GRCm39.dna.primary_assembly.fa rat conservation/fastas/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa human conservation/fastas/Homo_sapiens.GRCh38.dna.primary_assembly.fa naked_mole_rat conservation/fastas/Heterocephalus_glaber_female.HetGla_female_1.0.dna.toplevel.fa guinea_pig conservation/fastas/Cavia_porcellus.Cavpor3.0.dna.toplevel.fa squirrel conservation/fastas/Ictidomys_tridecemlineatus.SpeTri2.0.dna.toplevel.fa rabbit conservation/fastas/Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa pika conservation/fastas/Ochotona_princeps.OchPri2.0-Ens.dna.toplevel.fa

And I have downloaded the reference GTF file and fasta files from https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/genes/criGri1.refGene.gtf.gz and https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/criGri1.fa.gz

Thanks so much,

Marina

marcasriv avatar Oct 14 '21 12:10 marcasriv

Hi Marina,

thank you for trying out PhyloCSF++ and opening an issue! I made a fix and pushed it to the master branch. Can you try running it again with the latest commit? Let me know if you need help building PhyloCSF++ from source, I can also upload a statically linked binary here.

If the fix works for you, we will make a new release, update it on bioconda and distribute new binaries.

Christopher

cpockrandt avatar Oct 22 '21 21:10 cpockrandt

Hi Christopher,

Thanks so much for your help and fix! I re-built PhyloCSF++ with the latest commit and it is now running smoothly pass the error. Unfortunately I've bumped into a new problem. The program it's crashing now at (I believe) line 422 in script phylocsf++annotate_with_mmseqs.hpp (same parameters/files as in previous post):

mmseqs result2dnamsa conservation//cds/cds.index conservation//genomesDB/genbankseqs /conservation//aln/aln_all_tophit conservation//aln/msa --threads _40

MMseqs Version: 42bf6438fec1e1b987f46d8f6d4b09926ecfc019 Skip query false Threads 40 Compressed 0 Verbosity 3 Query database size: 99405 type: Nucleotide Target database size: 410501 type: Nucleotide [=================================================================] 100.00% 99.40K 7m 13s 889ms Time for merging to msa: 0h 0m 0s 216ms Time for processing: 0h 7m 15s 116ms MMseqs2: Score aligned CDS ...

terminate called after throwing an instance of 'std::length_error' terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively what(): terminate called recursively terminate called recursively terminate called recursively Aborted (core dumped)

Thanks again,

Marina

marcasriv avatar Oct 25 '21 12:10 marcasriv

Can you give me the list of assemblies you used, so that we can try to reproduce this error?

cpockrandt avatar Oct 25 '21 14:10 cpockrandt

Hi Christopher,

Sorry for the late reply. This is the list of fasta files I use:

https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/criGri1.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/rn6/bigZips/rn6.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hetGla2/bigZips/hetGla2.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/cavPor3/bigZips/cavPor3.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/speTri2/bigZips/speTri2.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/oryCun2/bigZips/oryCun2.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/ochPri3/bigZips/ochPri3.fa.gz

and reference GTF:

https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/genes/criGri1.refGene.gtf.gz

Thanks,

Marina

marcasriv avatar Oct 26 '21 07:10 marcasriv

Hi Marina,

thank you, we were able to reproduce the error and added a fix to the master branch. Before you run it again, please make sure to delete any temporary files in the output directory from the previous runs.

Christopher

cpockrandt avatar Nov 15 '21 02:11 cpockrandt

Hi Christopher,

Thanks so much for your reply. I've removed the previous installation of PhyloCSF++ , cloned the latest PhyloCSF++ version and re-installed, and removed any previous files but I'm still getting the same error in the same line of code. I've also tried to change the location of the output directory , but unfortunately no luck so far. Could there be anything in my system overriding the new install?

Marina

marcasriv avatar Nov 17 '21 14:11 marcasriv

Hi Marina,

I tried it on another system and it works for me with the latest commit and data set that you listed above. You don't have to "install" PhyloCSF++ on your system, after make you can just call the binary directly in the build directory with ./phylocsf++ to make sure that you really use the latest build and not an outdated binary that might still be in the PATH.

cpockrandt avatar Nov 28 '21 05:11 cpockrandt