Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

species not incuded in target list of species

Open adamfreedman opened this issue 3 years ago • 2 comments

I am running into an issue where, after building a hal file for 4 species using cactus, then splitting into mafs with hal2maf_split.pl , running augustus in comparative mode leads to the following:

"Warning: Species Bmor Dple are not included in the target list of species. These alignment lines are ingored."

NOTE: all of this was done within a singularity container (which worked fine with the toy vertebrates data)

conversion to maf was done as follows: CHUNKSIZE=2500000 OVERLAP=500000 /root/augustus/scripts/hal2maf_split.pl --halfile 4butterflies_cactus_brchlen1.hal --refGenome HmelRef --cpus 8 --chunksize $CHUNKSIZE --overlap $OVERLAP --no_split_list no_split_list.txt --outdir mafs

.. where the no split list was derived from the H. melpomene annotation per the recommendations on the augustus web page at greifswald

I saw that in a previous post there was an issue when there were dashes in genome names ... not the case with my data. the input tree (for both cactus and augustus) is (((HmelRef:1,HeraRef:1),Dple:1),Bmor:1);

We set br lengths equal since we didn't have estimates for these.

the output mafs have a version of this tree with ancestors added: hal (((HmelRef:1,HeraRef:1)Anc2:1,Dple:1)Anc1:1,Bmor:1)Anc0;

halStats on the hal file reproduces this: hal v2.1 (((HmelRef:1,HeraRef:1)Anc2:1,Dple:1)Anc1:1,Bmor:1)Anc0;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments Anc0, 2, 59966479, 9517, 0, 2797322 Anc1, 2, 92050684, 10560, 3125681, 6056811 Anc2, 2, 210794426, 6213, 6986658, 13317174 HmelRef, 0, 275198219, 794, 15625981, 0 HeraRef, 0, 382828983, 195, 17856539, 0 Dple, 0, 246770625, 1708, 8027394, 0 Bmor, 0, 456867856, 4783, 5149758, 0

the augustus cmd structure is as follows:

augustus --species=heliconius_melpomene1 --softmasking=0 --treefile=../tree.nwk --alnfile=../maflinks/${SLURM_ARRAY_TASK_ID}.maf --dbaccess=../butterflies_4spec.db --speciesfilenames=../genomes.tbl --alternatives-from-evidence=0 --/CompPred/outdir=pred${SLURM_ARRAY_TASK_ID}

where genomes.tbl is as follows:

Bmor /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/Bmor.fa Dple /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/Dple.fa HeraRef /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/HeraRef.fa HmelRef /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/HmelRef.fa

No one in our group can figure out what the heck is going on here. Any ideas?

adamfreedman avatar Nov 20 '20 14:11 adamfreedman

I can't see why from the data you are sending. Are these two species in ../butterflies_4spec.db?

You could check the presence of these two genomes with

sqlite3 -header ../butterflies_4spec.db "
SELECT speciesname,
sum(end-start+1) AS 'genome length',
count(distinct seqnr) AS '# seqs'
FROM genomes natural join speciesnames
GROUP BY speciesname;"

Otherwise, if you put example data somewhere for us to reproduce the error, I could likely tell you what it is and improve the error message as well.

Am Fr., 20. Nov. 2020 um 15:34 Uhr schrieb Adam H. Freedman < [email protected]>:

I am running into an issue where, after building a hal file for 4 species using cactus, then splitting into mafs with hal2maf_split.pl , running augustus in comparative mode leads to the following:

"Warning: Species Bmor Dple are not included in the target list of species. These alignment lines are ingored."

NOTE: all of this was done within a singularity container (which worked fine with the toy vertebrates data)

conversion to maf was done as follows: CHUNKSIZE=2500000 OVERLAP=500000 /root/augustus/scripts/hal2maf_split.pl --halfile 4butterflies_cactus_brchlen1.hal --refGenome HmelRef --cpus 8 --chunksize $CHUNKSIZE --overlap $OVERLAP --no_split_list no_split_list.txt --outdir mafs

.. where the no split list was derived from the H. melpomene annotation per the recommendations on the augustus web page at greifswald

I saw that in a previous post there was an issue when there were dashes in genome names ... not the case with my data. the input tree (for both cactus and augustus) is (((HmelRef:1,HeraRef:1),Dple:1),Bmor:1);

We set br lengths equal since we didn't have estimates for these.

the output mafs have a version of this tree with ancestors added: hal (((HmelRef:1,HeraRef:1)Anc2:1,Dple:1)Anc1:1,Bmor:1)Anc0;

halStats on the hal file reproduces this: hal v2.1 (((HmelRef:1,HeraRef:1)Anc2:1,Dple:1)Anc1:1,Bmor:1)Anc0;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments Anc0, 2, 59966479, 9517, 0, 2797322 Anc1, 2, 92050684, 10560, 3125681, 6056811 Anc2, 2, 210794426, 6213, 6986658, 13317174 HmelRef, 0, 275198219, 794, 15625981, 0 HeraRef, 0, 382828983, 195, 17856539, 0 Dple, 0, 246770625, 1708, 8027394, 0 Bmor, 0, 456867856, 4783, 5149758, 0

the augustus cmd structure is as follows:

augustus --species=heliconius_melpomene1 --softmasking=0 --treefile=../tree.nwk --alnfile=../maflinks/${SLURM_ARRAY_TASK_ID}.maf --dbaccess=../butterflies_4spec.db --speciesfilenames=../genomes.tbl --alternatives-from-evidence=0 --/CompPred/outdir=pred${SLURM_ARRAY_TASK_ID}

where genomes.tbl is as follows:

Bmor /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/Bmor.fa Dple /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/Dple.fa HeraRef /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/HeraRef.fa HmelRef /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/HmelRef.fa

No one in our group can figure out what the heck is going on here. Any ideas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUQQJQ4YPS5QK3MEUFL3ATSQZ47RANCNFSM4T43G5FA .

MarioStanke avatar Nov 22 '20 18:11 MarioStanke

Thanks for the reply.

I ran the cmd you suggested (but with the -column arg added per the tutorial docs

sqlite3 -header -column butterflies_4spec.db "
SELECT speciesname,
sum(end-start+1) AS 'genome length',
count(*) AS '# chunks',
count(distinct seqnr) AS '# seqs'
FROM genomes natural join speciesnames
GROUP BY speciesname;"

produces the following table:

speciesname genome length # chunks # seqs


Bmor 456867856 13371 4783 Dple 246770625 6219 1708 HeraRef 382828983 7756 195 HmelRef 275198219 6016 794

As you can see, all 4 of the species are present in the database.

Genomes and nwk tree are available at google drive at: https://drive.google.com/file/d/1KZv_ea4tnIai1Kds89kakTIejLiWBm6r/view?usp=sharing

Thanks in advance for any insight you can provide re the problem!

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Mario Stanke [email protected] Sent: Sunday, November 22, 2020 1:59 PM To: Gaius-Augustus/Augustus [email protected] Cc: Freedman, Adam [email protected]; Author [email protected] Subject: Re: [Gaius-Augustus/Augustus] species not incuded in target list of species (#218)

I can't see why from the data you are sending. Are these two species in ../butterflies_4spec.db?

You could check the presence of these two genomes with

sqlite3 -header ../butterflies_4spec.db "
SELECT speciesname,
sum(end-start+1) AS 'genome length',
count(distinct seqnr) AS '# seqs'
FROM genomes natural join speciesnames
GROUP BY speciesname;"

Otherwise, if you put example data somewhere for us to reproduce the error, I could likely tell you what it is and improve the error message as well.

Am Fr., 20. Nov. 2020 um 15:34 Uhr schrieb Adam H. Freedman < [email protected]>:

I am running into an issue where, after building a hal file for 4 species using cactus, then splitting into mafs with hal2maf_split.pl , running augustus in comparative mode leads to the following:

"Warning: Species Bmor Dple are not included in the target list of species. These alignment lines are ingored."

NOTE: all of this was done within a singularity container (which worked fine with the toy vertebrates data)

conversion to maf was done as follows: CHUNKSIZE=2500000 OVERLAP=500000 /root/augustus/scripts/hal2maf_split.pl --halfile 4butterflies_cactus_brchlen1.hal --refGenome HmelRef --cpus 8 --chunksize $CHUNKSIZE --overlap $OVERLAP --no_split_list no_split_list.txt --outdir mafs

.. where the no split list was derived from the H. melpomene annotation per the recommendations on the augustus web page at greifswald

I saw that in a previous post there was an issue when there were dashes in genome names ... not the case with my data. the input tree (for both cactus and augustus) is (((HmelRef:1,HeraRef:1),Dple:1),Bmor:1);

We set br lengths equal since we didn't have estimates for these.

the output mafs have a version of this tree with ancestors added: hal (((HmelRef:1,HeraRef:1)Anc2:1,Dple:1)Anc1:1,Bmor:1)Anc0;

halStats on the hal file reproduces this: hal v2.1 (((HmelRef:1,HeraRef:1)Anc2:1,Dple:1)Anc1:1,Bmor:1)Anc0;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments Anc0, 2, 59966479, 9517, 0, 2797322 Anc1, 2, 92050684, 10560, 3125681, 6056811 Anc2, 2, 210794426, 6213, 6986658, 13317174 HmelRef, 0, 275198219, 794, 15625981, 0 HeraRef, 0, 382828983, 195, 17856539, 0 Dple, 0, 246770625, 1708, 8027394, 0 Bmor, 0, 456867856, 4783, 5149758, 0

the augustus cmd structure is as follows:

augustus --species=heliconius_melpomene1 --softmasking=0 --treefile=../tree.nwk --alnfile=../maflinks/${SLURM_ARRAY_TASK_ID}.maf --dbaccess=../butterflies_4spec.db --speciesfilenames=../genomes.tbl --alternatives-from-evidence=0 --/CompPred/outdir=pred${SLURM_ARRAY_TASK_ID}

where genomes.tbl is as follows:

Bmor /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/Bmor.fa Dple /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/Dple.fa HeraRef /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/HeraRef.fa HmelRef /n/holyscratch01/informatics/genome_annotation_evaluation/heliconines/comparative_augustus/HmelRef.fa

No one in our group can figure out what the heck is going on here. Any ideas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUQQJQ4YPS5QK3MEUFL3ATSQZ47RANCNFSM4T43G5FA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Gaius-2DAugustus_Augustus_issues_218-23issuecomment-2D731815780&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=YPY2C5GoVJS6rdvncx2hqNU7skgmKYizI7BZaXwzKy4&s=hS7vfHqiD0EKbsXlzuReVtjsVyxVtVKCiYW4IcPW9yE&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADBMMUECIV6EQM2DUYJTBIDSRFNRJANCNFSM4T43G5FA&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=YPY2C5GoVJS6rdvncx2hqNU7skgmKYizI7BZaXwzKy4&s=YvfjaBcJuYDq5uz5fLAUYPS8wxXZrfXinwUbLpho2ZA&e=.

adamfreedman avatar Nov 23 '20 16:11 adamfreedman